Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yang Liu

GCA-3D: Towards Generalized and Consistent Domain Adaptation of 3D Generators

Dec 20, 2024

Hengjia Li, Yang Liu, Yibo Zhao, Haoran Cheng, Yang Yang, Linxuan Xia, Zekai Luo, Qibo Qiu, Boxi Wu, Tu Zheng(+2 more)

Abstract:Recently, 3D generative domain adaptation has emerged to adapt the pre-trained generator to other domains without collecting massive datasets and camera pose distributions. Typically, they leverage large-scale pre-trained text-to-image diffusion models to synthesize images for the target domain and then fine-tune the 3D model. However, they suffer from the tedious pipeline of data generation, which inevitably introduces pose bias between the source domain and synthetic dataset. Furthermore, they are not generalized to support one-shot image-guided domain adaptation, which is more challenging due to the more severe pose bias and additional identity bias introduced by the single image reference. To address these issues, we propose GCA-3D, a generalized and consistent 3D domain adaptation method without the intricate pipeline of data generation. Different from previous pipeline methods, we introduce multi-modal depth-aware score distillation sampling loss to efficiently adapt 3D generative models in a non-adversarial manner. This multi-modal loss enables GCA-3D in both text prompt and one-shot image prompt adaptation. Besides, it leverages per-instance depth maps from the volume rendering module to mitigate the overfitting problem and retain the diversity of results. To enhance the pose and identity consistency, we further propose a hierarchical spatial consistency loss to align the spatial structure between the generated images in the source and target domain. Experiments demonstrate that GCA-3D outperforms previous methods in terms of efficiency, generalization, pose accuracy, and identity consistency.

Via

Access Paper or Ask Questions

Crabs: Consuming Resrouce via Auto-generation for LLM-DoS Attack under Black-box Settings

Dec 18, 2024

Yuanhe Zhang, Zhenhong Zhou, Wei Zhang, Xinyue Wang, Xiaojun Jia, Yang Liu, Sen Su

Figure 1 for Crabs: Consuming Resrouce via Auto-generation for LLM-DoS Attack under Black-box Settings

Figure 2 for Crabs: Consuming Resrouce via Auto-generation for LLM-DoS Attack under Black-box Settings

Figure 3 for Crabs: Consuming Resrouce via Auto-generation for LLM-DoS Attack under Black-box Settings

Figure 4 for Crabs: Consuming Resrouce via Auto-generation for LLM-DoS Attack under Black-box Settings

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks. LLMs continue to be vulnerable to external threats, particularly Denial-of-Service (DoS) attacks. Specifically, LLM-DoS attacks aim to exhaust computational resources and block services. However, prior works tend to focus on performing white-box attacks, overlooking black-box settings. In this work, we propose an automated algorithm designed for black-box LLMs, called Auto-Generation for LLM-DoS Attack (AutoDoS). AutoDoS introduces DoS Attack Tree and optimizes the prompt node coverage to enhance effectiveness under black-box conditions. Our method can bypass existing defense with enhanced stealthiness via semantic improvement of prompt nodes. Furthermore, we reveal that implanting Length Trojan in Basic DoS Prompt aids in achieving higher attack efficacy. Experimental results show that AutoDoS amplifies service response latency by over 250 $\times \uparrow$, leading to severe resource consumption in terms of GPU utilization and memory usage. Our code is available at \url{https://github.com/shuita2333/AutoDoS}.

* 20 pages, 7 figures, 11 tables

Via

Access Paper or Ask Questions

Defending LVLMs Against Vision Attacks through Partial-Perception Supervision

Dec 17, 2024

Qi Zhou, Tianlin Li, Qing Guo, Dongxia Wang, Yun Lin, Yang Liu, Jin Song Dong

Figure 1 for Defending LVLMs Against Vision Attacks through Partial-Perception Supervision

Figure 2 for Defending LVLMs Against Vision Attacks through Partial-Perception Supervision

Figure 3 for Defending LVLMs Against Vision Attacks through Partial-Perception Supervision

Figure 4 for Defending LVLMs Against Vision Attacks through Partial-Perception Supervision

Abstract:Recent studies have raised significant concerns regarding the vulnerability of Large Vision Language Models (LVLMs) to maliciously injected or perturbed input images, which can mislead their responses. Existing defense methods show that such vision attacks are sensitive to image modifications especially cropping, using majority voting across responses of modified images as corrected responses. However, these modifications often result in partial images and distort the semantics, which reduces response quality on clean images after voting. Instead of directly using responses from partial images for voting, we investigate using them to supervise the LVLM's responses to the original images. We propose a black-box, training-free method called DPS (Defense through Partial-Perception Supervision). In this approach, the model is prompted using the responses generated by a model that perceives only a partial image. With DPS, the model can adjust its response based on partial image understanding when under attack, while confidently maintaining its original response for clean input. Our findings show that the weak model can supervise the strong model: when faced with an attacked input, the strong model becomes less confident and adjusts its response based on the weak model's partial understanding, effectively defending against the attack. With clean input, it confidently maintains its original response. Empirical experiments show our method outperforms the baseline, cutting the average attack success rate by 76.3% across six datasets on three popular models.

Via

Access Paper or Ask Questions

Neural-Network-Driven Reward Prediction as a Heuristic: Advancing Q-Learning for Mobile Robot Path Planning

Dec 17, 2024

Yiming Ji, Kaijie Yun, Yang Liu, Zongwu Xie, Hong Liu

Figure 1 for Neural-Network-Driven Reward Prediction as a Heuristic: Advancing Q-Learning for Mobile Robot Path Planning

Figure 2 for Neural-Network-Driven Reward Prediction as a Heuristic: Advancing Q-Learning for Mobile Robot Path Planning

Figure 3 for Neural-Network-Driven Reward Prediction as a Heuristic: Advancing Q-Learning for Mobile Robot Path Planning

Figure 4 for Neural-Network-Driven Reward Prediction as a Heuristic: Advancing Q-Learning for Mobile Robot Path Planning

Abstract:Q-learning is a widely used reinforcement learning technique for solving path planning problems. It primarily involves the interaction between an agent and its environment, enabling the agent to learn an optimal strategy that maximizes cumulative rewards. Although many studies have reported the effectiveness of Q-learning, it still faces slow convergence issues in practical applications. To address this issue, we propose the NDR-QL method, which utilizes neural network outputs as heuristic information to accelerate the convergence process of Q-learning. Specifically, we improved the dual-output neural network model by introducing a start-end channel separation mechanism and enhancing the feature fusion process. After training, the proposed NDR model can output a narrowly focused optimal probability distribution, referred to as the guideline, and a broadly distributed suboptimal distribution, referred to as the region. Subsequently, based on the guideline prediction, we calculate the continuous reward function for the Q-learning method, and based on the region prediction, we initialize the Q-table with a bias. We conducted training, validation, and path planning simulation experiments on public datasets. The results indicate that the NDR model outperforms previous methods by up to 5\% in prediction accuracy. Furthermore, the proposed NDR-QL method improves the convergence speed of the baseline Q-learning method by 90\% and also surpasses the previously improved Q-learning methods in path quality metrics.

Via

Access Paper or Ask Questions

What External Knowledge is Preferred by LLMs? Characterizing and Exploring Chain of Evidence in Imperfect Context

Dec 17, 2024

Zhiyuan Chang, Mingyang Li, Xiaojun Jia, Junjie Wang, Yuekai Huang, Qing Wang, Yihao Huang, Yang Liu

Figure 1 for What External Knowledge is Preferred by LLMs? Characterizing and Exploring Chain of Evidence in Imperfect Context

Figure 2 for What External Knowledge is Preferred by LLMs? Characterizing and Exploring Chain of Evidence in Imperfect Context

Figure 3 for What External Knowledge is Preferred by LLMs? Characterizing and Exploring Chain of Evidence in Imperfect Context

Figure 4 for What External Knowledge is Preferred by LLMs? Characterizing and Exploring Chain of Evidence in Imperfect Context

Abstract:Incorporating external knowledge into large language models (LLMs) has emerged as a promising approach to mitigate outdated knowledge and hallucination in LLMs. However, external knowledge is often imperfect. In addition to useful knowledge, external knowledge is rich in irrelevant or misinformation in the context that can impair the reliability of LLM responses. This paper focuses on LLMs' preferred external knowledge in imperfect contexts when handling multi-hop QA. Inspired by criminal procedural law's Chain of Evidence (CoE), we characterize that knowledge preferred by LLMs should maintain both relevance to the question and mutual support among knowledge pieces. Accordingly, we propose an automated CoE discrimination approach and explore LLMs' preferences from their effectiveness, faithfulness and robustness, as well as CoE's usability in a naive Retrieval-Augmented Generation (RAG) case. The evaluation on five LLMs reveals that CoE enhances LLMs through more accurate generation, stronger answer faithfulness, better robustness against knowledge conflict, and improved performance in a popular RAG case.

* 12 pages, 4 figures

Via

Access Paper or Ask Questions

SpatialMe: Stereo Video Conversion Using Depth-Warping and Blend-Inpainting

Dec 16, 2024

Jiale Zhang, Qianxi Jia, Yang Liu, Wei Zhang, Wei Wei, Xin Tian

Abstract:Stereo video conversion aims to transform monocular videos into immersive stereo format. Despite the advancements in novel view synthesis, it still remains two major challenges: i) difficulty of achieving high-fidelity and stable results, and ii) insufficiency of high-quality stereo video data. In this paper, we introduce SpatialMe, a novel stereo video conversion framework based on depth-warping and blend-inpainting. Specifically, we propose a mask-based hierarchy feature update (MHFU) refiner, which integrate and refine the outputs from designed multi-branch inpainting module, using feature update unit (FUU) and mask mechanism. We also propose a disparity expansion strategy to address the problem of foreground bleeding. Furthermore, we conduct a high-quality real-world stereo video dataset -- StereoV1K, to alleviate the data shortage. It contains 1000 stereo videos captured in real-world at a resolution of 1180 x 1180, covering various indoor and outdoor scenes. Extensive experiments demonstrate the superiority of our approach in generating stereo videos over state-of-the-art methods.

Via

Access Paper or Ask Questions

Quantization of Climate Change Impacts on Renewable Energy Generation Capacity: A Super-Resolution Recurrent Diffusion Model

Dec 16, 2024

Xiaochong Dong, Jun Dan, Yingyun Sun, Yang Liu, Xuemin Zhang, Shengwei Mei

Figure 1 for Quantization of Climate Change Impacts on Renewable Energy Generation Capacity: A Super-Resolution Recurrent Diffusion Model

Figure 2 for Quantization of Climate Change Impacts on Renewable Energy Generation Capacity: A Super-Resolution Recurrent Diffusion Model

Figure 3 for Quantization of Climate Change Impacts on Renewable Energy Generation Capacity: A Super-Resolution Recurrent Diffusion Model

Figure 4 for Quantization of Climate Change Impacts on Renewable Energy Generation Capacity: A Super-Resolution Recurrent Diffusion Model

Abstract:Driven by global climate change and the ongoing energy transition, the coupling between power supply capabilities and meteorological factors has become increasingly significant. Over the long term, accurately quantifying the power generation capacity of renewable energy under the influence of climate change is essential for the development of sustainable power systems. However, due to interdisciplinary differences in data requirements, climate data often lacks the necessary hourly resolution to capture the short-term variability and uncertainties of renewable energy resources. To address this limitation, a super-resolution recurrent diffusion model (SRDM) has been developed to enhance the temporal resolution of climate data and model the short-term uncertainty. The SRDM incorporates a pre-trained decoder and a denoising network, that generates long-term, high-resolution climate data through a recurrent coupling mechanism. The high-resolution climate data is then converted into power value using the mechanism model, enabling the simulation of wind and photovoltaic (PV) power generation capacity on future long-term scales. Case studies were conducted in the Ejina region of Inner Mongolia, China, using fifth-generation reanalysis (ERA5) and coupled model intercomparison project (CMIP6) data under two climate pathways: SSP126 and SSP585. The results demonstrate that the SRDM outperforms existing generative models in generating super-resolution climate data. For the Ejina region, under a high-emission pathway, the annual utilization hours of wind power are projected to decrease by 2.82 hours/year, while those for PV power are projected to decrease by 0.26 hours/year. Furthermore, the research highlights the estimation biases introduced when low-resolution climate data is used for power conversion.

Via

Access Paper or Ask Questions

Exploring Enhanced Contextual Information for Video-Level Object Tracking

Dec 15, 2024

Ben Kang, Xin Chen, Simiao Lai, Yang Liu, Yi Liu, Dong Wang

Figure 1 for Exploring Enhanced Contextual Information for Video-Level Object Tracking

Figure 2 for Exploring Enhanced Contextual Information for Video-Level Object Tracking

Figure 3 for Exploring Enhanced Contextual Information for Video-Level Object Tracking

Figure 4 for Exploring Enhanced Contextual Information for Video-Level Object Tracking

Abstract:Contextual information at the video level has become increasingly crucial for visual object tracking. However, existing methods typically use only a few tokens to convey this information, which can lead to information loss and limit their ability to fully capture the context. To address this issue, we propose a new video-level visual object tracking framework called MCITrack. It leverages Mamba's hidden states to continuously record and transmit extensive contextual information throughout the video stream, resulting in more robust object tracking. The core component of MCITrack is the Contextual Information Fusion module, which consists of the mamba layer and the cross-attention layer. The mamba layer stores historical contextual information, while the cross-attention layer integrates this information into the current visual features of each backbone block. This module enhances the model's ability to capture and utilize contextual information at multiple levels through deep integration with the backbone. Experiments demonstrate that MCITrack achieves competitive performance across numerous benchmarks. For instance, it gets 76.6% AUC on LaSOT and 80.0% AO on GOT-10k, establishing a new state-of-the-art performance. Code and models are available at https://github.com/kangben258/MCITrack.

* This paper was accepted by AAAI2025

Via

Access Paper or Ask Questions

MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt

Dec 14, 2024

Yuhao Wang, Xuehu Liu, Tianyu Yan, Yang Liu, Aihua Zheng, Pingping Zhang, Huchuan Lu

Figure 1 for MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt

Figure 2 for MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt

Figure 3 for MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt

Figure 4 for MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt

Abstract:Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities. Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal object ReID tasks. However, they remain unexplored for multi-modal object ReID. Furthermore, current multi-modal aggregation methods have obvious limitations in dealing with long sequences from different modalities. To address above issues, we introduce a novel framework called MambaPro for multi-modal object ReID. To be specific, we first employ a Parallel Feed-Forward Adapter (PFA) for adapting CLIP to multi-modal object ReID. Then, we propose the Synergistic Residual Prompt (SRP) to guide the joint learning of multi-modal features. Finally, leveraging Mamba's superior scalability for long sequences, we introduce Mamba Aggregation (MA) to efficiently model interactions between different modalities. As a result, MambaPro could extract more robust features with lower complexity. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100 and MSVR310) validate the effectiveness of our proposed methods. The source code is available at https://github.com/924973292/MambaPro.

* This work is accepted by AAAI2025. More modifications may be performed

Via

Access Paper or Ask Questions

DeMo: Decoupled Feature-Based Mixture of Experts for Multi-Modal Object Re-Identification

Dec 14, 2024

Yuhao Wang, Yang Liu, Aihua Zheng, Pingping Zhang

Abstract:Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by combining complementary information from multiple modalities. Existing multi-modal object ReID methods primarily focus on the fusion of heterogeneous features. However, they often overlook the dynamic quality changes in multi-modal imaging. In addition, the shared information between different modalities can weaken modality-specific information. To address these issues, we propose a novel feature learning framework called DeMo for multi-modal object ReID, which adaptively balances decoupled features using a mixture of experts. To be specific, we first deploy a Patch-Integrated Feature Extractor (PIFE) to extract multi-granularity and multi-modal features. Then, we introduce a Hierarchical Decoupling Module (HDM) to decouple multi-modal features into non-overlapping forms, preserving the modality uniqueness and increasing the feature diversity. Finally, we propose an Attention-Triggered Mixture of Experts (ATMoE), which replaces traditional gating with dynamic attention weights derived from decoupled features. With these modules, our DeMo can generate more robust multi-modal features. Extensive experiments on three multi-modal object ReID benchmarks fully verify the effectiveness of our methods. The source code is available at https://github.com/924973292/DeMo.

* This work is accepted by AAAI2025. More motifications may be performed

Via

Access Paper or Ask Questions