Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jingkuan Song

Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

May 24, 2024

Beitao Chen, Xinyu Lyu, Lianli Gao, Jingkuan Song, Heng Tao Shen

Figure 1 for Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

Figure 2 for Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

Figure 3 for Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

Figure 4 for Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

Abstract:Although Large Visual Language Models (LVLMs) have demonstrated exceptional abilities in understanding multimodal data, they invariably suffer from hallucinations, leading to a disconnect between the generated text and the corresponding images. Almost all current visual contrastive decoding methods attempt to mitigate these hallucinations by introducing visual uncertainty information that appropriately widens the contrastive logits gap between hallucinatory and targeted ones. However, due to uncontrollable nature of the global visual uncertainty, they struggle to precisely induce the hallucinatory tokens, which severely limits their effectiveness in mitigating hallucinations and may even lead to the generation of undesired hallucinations. To tackle this issue, we conducted the theoretical analysis to promote the effectiveness of contrast decoding. Building on this insight, we introduce a novel optimization strategy named Hallucination-Induced Optimization (HIO). This strategy seeks to amplify the contrast between hallucinatory and targeted tokens relying on a fine-tuned theoretical preference model (i.e., Contrary Bradley-Terry Model), thereby facilitating efficient contrast decoding to alleviate hallucinations in LVLMs. Extensive experimental research demonstrates that our HIO strategy can effectively reduce hallucinations in LVLMs, outperforming state-of-the-art methods across various benchmarks.

* 10 pages. arXiv admin note: text overlap with arXiv:2311.16922 by other authors

Via

Access Paper or Ask Questions

Text-Video Retrieval with Global-Local Semantic Consistent Learning

May 21, 2024

Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Yihang Duan, Xinyu Lyu, Hengtao Shen

Figure 1 for Text-Video Retrieval with Global-Local Semantic Consistent Learning

Figure 2 for Text-Video Retrieval with Global-Local Semantic Consistent Learning

Figure 3 for Text-Video Retrieval with Global-Local Semantic Consistent Learning

Figure 4 for Text-Video Retrieval with Global-Local Semantic Consistent Learning

Abstract:Adapting large-scale image-text pre-training models, e.g., CLIP, to the video domain represents the current state-of-the-art for text-video retrieval. The primary approaches involve transferring text-video pairs to a common embedding space and leveraging cross-modal interactions on specific entities for semantic alignment. Though effective, these paradigms entail prohibitive computational costs, leading to inefficient retrieval. To address this, we propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL), which capitalizes on latent shared semantics across modalities for text-video retrieval. Specifically, we introduce a parameter-free global interaction module to explore coarse-grained alignment. Then, we devise a shared local interaction module that employs several learnable queries to capture latent semantic concepts for learning fine-grained alignment. Furthermore, an Inter-Consistency Loss (ICL) is devised to accomplish the concept alignment between the visual query and corresponding textual query, and an Intra-Diversity Loss (IDL) is developed to repulse the distribution within visual (textual) queries to generate more discriminative concepts. Extensive experiments on five widely used benchmarks (i.e., MSR-VTT, MSVD, DiDeMo, LSMDC, and ActivityNet) substantiate the superior effectiveness and efficiency of the proposed method. Remarkably, our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost. Code is available at: https://github.com/zchoi/GLSCL.

* 9 pages

Via

Access Paper or Ask Questions

RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception

May 17, 2024

Xiaosu Zhu, Hualian Sheng, Sijia Cai, Bing Deng, Shaopeng Yang, Qiao Liang, Ken Chen, Lianli Gao, Jingkuan Song, Jieping Ye

Figure 1 for RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception

Figure 2 for RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception

Figure 3 for RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception

Figure 4 for RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception

Abstract:We introduce RoScenes, the largest multi-view roadside perception dataset, which aims to shed light on the development of vision-centric Bird's Eye View (BEV) approaches for more challenging traffic scenes. The highlights of RoScenes include significantly large perception area, full scene coverage and crowded traffic. More specifically, our dataset achieves surprising 21.13M 3D annotations within 64,000 $m^2$. To relieve the expensive costs of roadside 3D labeling, we present a novel BEV-to-3D joint annotation pipeline to efficiently collect such a large volume of data. After that, we organize a comprehensive study for current BEV methods on RoScenes in terms of effectiveness and efficiency. Tested methods suffer from the vast perception area and variation of sensor layout across scenes, resulting in performance levels falling below expectations. To this end, we propose RoBEV that incorporates feature-guided position embedding for effective 2D-3D feature assignment. With its help, our method outperforms state-of-the-art by a large margin without extra computational overhead on validation set. Our dataset and devkit will be made available at https://github.com/xiaosu-zhu/RoScenes.

* Technical report. 32 pages, 21 figures, 13 tables. https://github.com/xiaosu-zhu/RoScenes

Via

Access Paper or Ask Questions

EchoReel: Enhancing Action Generation of Existing Video Diffusion Models

Mar 18, 2024

Jianzhi liu, Junchen Zhu, Lianli Gao, Jingkuan Song

Abstract:Recent large-scale video datasets have facilitated the generation of diverse open-domain videos of Video Diffusion Models (VDMs). Nonetheless, the efficacy of VDMs in assimilating complex knowledge from these datasets remains constrained by their inherent scale, leading to suboptimal comprehension and synthesis of numerous actions. In this paper, we introduce EchoReel, a novel approach to augment the capability of VDMs in generating intricate actions by emulating motions from pre-existing videos, which are readily accessible from databases or online repositories. EchoReel seamlessly integrates with existing VDMs, enhancing their ability to produce realistic motions without compromising their fundamental capabilities. Specifically, the Action Prism (AP), is introduced to distill motion information from reference videos, which requires training on only a small dataset. Leveraging the knowledge from pre-trained VDMs, EchoReel incorporates new action features into VDMs through the additional layers, eliminating the need for any further fine-tuning of untrained actions. Extensive experiments demonstrate that EchoReel is not merely replicating the whole content from references, and it significantly improves the generation of realistic actions, even in situations where existing VDMs might directly fail.

* 22 pages, 10 figures

Via

Access Paper or Ask Questions

CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model

Mar 13, 2024

Cheng Chen, Junchen Zhu, Xu Luo, Hengtao Shen, Lianli Gao, Jingkuan Song

Figure 1 for CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model

Figure 2 for CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model

Figure 3 for CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model

Figure 4 for CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model

Abstract:Instruction tuning represents a prevalent strategy employed by Multimodal Large Language Models (MLLMs) to align with human instructions and adapt to new tasks. Nevertheless, MLLMs encounter the challenge of adapting to users' evolving knowledge and demands. Therefore, how to retain existing skills while acquiring new knowledge needs to be investigated. In this paper, we present a comprehensive benchmark, namely Continual Instruction tuNing (CoIN), to assess existing MLLMs in the sequential instruction tuning paradigm. CoIN comprises 10 commonly used datasets spanning 8 task categories, ensuring a diverse range of instructions and tasks. Besides, the trained model is evaluated from two aspects: Instruction Following and General Knowledge, which assess the alignment with human intention and knowledge preserved for reasoning, respectively. Experiments on CoIN demonstrate that current powerful MLLMs still suffer catastrophic forgetting, and the failure in intention alignment assumes the main responsibility, instead of the knowledge forgetting. To this end, we introduce MoELoRA to MLLMs which is effective to retain the previous instruction alignment. Experimental results consistently illustrate the forgetting decreased from this method on CoIN.

Via

Access Paper or Ask Questions

Training-Free Semantic Video Composition via Pre-trained Diffusion Model

Jan 17, 2024

Jiaqi Guo, Sitong Su, Junchen Zhu, Lianli Gao, Jingkuan Song

Figure 1 for Training-Free Semantic Video Composition via Pre-trained Diffusion Model

Figure 2 for Training-Free Semantic Video Composition via Pre-trained Diffusion Model

Figure 3 for Training-Free Semantic Video Composition via Pre-trained Diffusion Model

Figure 4 for Training-Free Semantic Video Composition via Pre-trained Diffusion Model

Abstract:The video composition task aims to integrate specified foregrounds and backgrounds from different videos into a harmonious composite. Current approaches, predominantly trained on videos with adjusted foreground color and lighting, struggle to address deep semantic disparities beyond superficial adjustments, such as domain gaps. Therefore, we propose a training-free pipeline employing a pre-trained diffusion model imbued with semantic prior knowledge, which can process composite videos with broader semantic disparities. Specifically, we process the video frames in a cascading manner and handle each frame in two processes with the diffusion model. In the inversion process, we propose Balanced Partial Inversion to obtain generation initial points that balance reversibility and modifiability. Then, in the generation process, we further propose Inter-Frame Augmented attention to augment foreground continuity across frames. Experimental results reveal that our pipeline successfully ensures the visual harmony and inter-frame coherence of the outputs, demonstrating efficacy in managing broader semantic disparities.

Via

Access Paper or Ask Questions

Context-based Transfer and Efficient Iterative Learning for Unbiased Scene Graph Generation

Dec 29, 2023

Qishen Chen, Xinyu Lyu, Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song

Figure 1 for Context-based Transfer and Efficient Iterative Learning for Unbiased Scene Graph Generation

Figure 2 for Context-based Transfer and Efficient Iterative Learning for Unbiased Scene Graph Generation

Figure 3 for Context-based Transfer and Efficient Iterative Learning for Unbiased Scene Graph Generation

Figure 4 for Context-based Transfer and Efficient Iterative Learning for Unbiased Scene Graph Generation

Abstract:Unbiased Scene Graph Generation (USGG) aims to address biased predictions in SGG. To that end, data transfer methods are designed to convert coarse-grained predicates into fine-grained ones, mitigating imbalanced distribution. However, them overlook contextual relevance between transferred labels and subject-object pairs, such as unsuitability of 'eating' for 'woman-table'. Furthermore, they typically involve a two-stage process with significant computational costs, starting with pre-training a model for data transfer, followed by training from scratch using transferred labels. Thus, we introduce a plug-and-play method named CITrans, which iteratively trains SGG models with progressively enhanced data. First, we introduce Context-Restricted Transfer (CRT), which imposes subject-object constraints within predicates' semantic space to achieve fine-grained data transfer. Subsequently, Efficient Iterative Learning (EIL) iteratively trains models and progressively generates enhanced labels which are consistent with model's learning state, thereby accelerating the training process. Finally, extensive experiments show that CITrans achieves state-of-the-art and results with high efficiency.

Via

Access Paper or Ask Questions

ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval

Dec 19, 2023

Kaipeng Fang, Jingkuan Song, Lianli Gao, Pengpeng Zeng, Zhi-Qi Cheng, Xiyao Li, Heng Tao Shen

Figure 1 for ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval

Figure 2 for ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval

Figure 3 for ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval

Figure 4 for ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval

Abstract:The goal of Universal Cross-Domain Retrieval (UCDR) is to achieve robust performance in generalized test scenarios, wherein data may belong to strictly unknown domains and categories during training. Recently, pre-trained models with prompt tuning have shown strong generalization capabilities and attained noteworthy achievements in various downstream tasks, such as few-shot learning and video-text retrieval. However, applying them directly to UCDR may not sufficiently to handle both domain shift (i.e., adapting to unfamiliar domains) and semantic shift (i.e., transferring to unknown categories). To this end, we propose Prompting-to-Simulate (ProS), the first method to apply prompt tuning for UCDR. ProS employs a two-step process to simulate Content-aware Dynamic Prompts (CaDP) which can impact models to produce generalized features for UCDR. Concretely, in Prompt Units Learning stage, we introduce two Prompt Units to individually capture domain and semantic knowledge in a mask-and-align way. Then, in Context-aware Simulator Learning stage, we train a Content-aware Prompt Simulator under a simulated test scenarios to produce the corresponding CaDP. Extensive experiments conducted on three benchmark datasets show that our method achieves new state-of-the-art performance without bringing excessive parameters. Our method is publicly available at https://anonymous.4open.science/r/ProS

Via

Access Paper or Ask Questions

F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis

Dec 06, 2023

Sitong Su, Jianzhi Liu, Lianli Gao, Jingkuan Song

Abstract:Recently Text-to-Video (T2V) synthesis has undergone a breakthrough by training transformers or diffusion models on large-scale datasets. Nevertheless, inferring such large models incurs huge costs.Previous inference acceleration works either require costly retraining or are model-specific.To address this issue, instead of retraining we explore the inference process of two mainstream T2V models using transformers and diffusion models.The exploration reveals the redundancy in temporal attention modules of both models, which are commonly utilized to establish temporal relations among frames.Consequently, we propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights.Specifically, when aggregate temporal attention values are ranked below a certain ratio, corresponding weights will be pruned.Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning in inference acceleration, quality assurance and broad applicability.

Via

Access Paper or Ask Questions

Make-A-Storyboard: A General Framework for Storyboard with Disentangled and Merged Control

Dec 06, 2023

Sitong Su, Litao Guo, Lianli Gao, Heng Tao Shen, Jingkuan Song

Figure 1 for Make-A-Storyboard: A General Framework for Storyboard with Disentangled and Merged Control

Figure 2 for Make-A-Storyboard: A General Framework for Storyboard with Disentangled and Merged Control

Figure 3 for Make-A-Storyboard: A General Framework for Storyboard with Disentangled and Merged Control

Figure 4 for Make-A-Storyboard: A General Framework for Storyboard with Disentangled and Merged Control

Abstract:Story Visualization aims to generate images aligned with story prompts, reflecting the coherence of storybooks through visual consistency among characters and scenes.Whereas current approaches exclusively concentrate on characters and neglect the visual consistency among contextually correlated scenes, resulting in independent character images without inter-image coherence.To tackle this issue, we propose a new presentation form for Story Visualization called Storyboard, inspired by film-making, as illustrated in Fig.1.Specifically, a Storyboard unfolds a story into visual representations scene by scene. Within each scene in Storyboard, characters engage in activities at the same location, necessitating both visually consistent scenes and characters.For Storyboard, we design a general framework coined as Make-A-Storyboard that applies disentangled control over the consistency of contextual correlated characters and scenes and then merge them to form harmonized images.Extensive experiments demonstrate 1) Effectiveness.the effectiveness of the method in story alignment, character consistency, and scene correlation; 2) Generalization. Our method could be seamlessly integrated into mainstream Image Customization methods, empowering them with the capability of story visualization.

Via

Access Paper or Ask Questions