Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chenliang Xu

Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning?

Jun 18, 2024

Mingqian Feng, Yunlong Tang, Zeliang Zhang, Chenliang Xu

Abstract:Large Vision-Language Models (LVLMs) excel in integrating visual and linguistic contexts to produce detailed content, facilitating applications such as image captioning. However, using LVLMs to generate descriptions often faces the challenge of object hallucination (OH), where the output text misrepresents actual objects in the input image. While previous studies attribute the occurrence of OH to the inclusion of more details, our study finds technical flaws in existing metrics, leading to unreliable evaluations of models and conclusions about OH. This has sparked a debate on the question: Do more details always introduce more hallucinations in LVLM-based image captioning? In this paper, we address this debate by proposing a novel decoding strategy, Differentiated Beam Decoding (DBD), along with a reliable new set of evaluation metrics: CLIP-Precision, CLIP-Recall, and CLIP-F1. DBD decodes the wealth of information hidden in visual input into distinct language representations called unit facts in parallel. This decoding is achieved via a well-designed differential score that guides the parallel search and candidate screening. The selected unit facts are then aggregated to generate the final caption. Our proposed metrics evaluate the comprehensiveness and accuracy of image captions by comparing the embedding groups of ground-truth image regions and generated text partitions. Extensive experiments on the Visual Genome dataset validate the effectiveness of our approach, demonstrating that it produces detailed descriptions while maintaining low hallucination levels.

Via

Access Paper or Ask Questions

Learning to Transform Dynamically for Better Adversarial Transferability

May 23, 2024

Rongyi Zhu, Zeliang Zhang, Susan Liang, Zhuo Liu, Chenliang Xu

Figure 1 for Learning to Transform Dynamically for Better Adversarial Transferability

Figure 2 for Learning to Transform Dynamically for Better Adversarial Transferability

Figure 3 for Learning to Transform Dynamically for Better Adversarial Transferability

Figure 4 for Learning to Transform Dynamically for Better Adversarial Transferability

Abstract:Adversarial examples, crafted by adding perturbations imperceptible to humans, can deceive neural networks. Recent studies identify the adversarial transferability across various models, \textit{i.e.}, the cross-model attack ability of adversarial samples. To enhance such adversarial transferability, existing input transformation-based methods diversify input data with transformation augmentation. However, their effectiveness is limited by the finite number of available transformations. In our study, we introduce a novel approach named Learning to Transform (L2T). L2T increases the diversity of transformed images by selecting the optimal combination of operations from a pool of candidates, consequently improving adversarial transferability. We conceptualize the selection of optimal transformation combinations as a trajectory optimization problem and employ a reinforcement learning strategy to effectively solve the problem. Comprehensive experiments on the ImageNet dataset, as well as practical tests with Google Vision and GPT-4V, reveal that L2T surpasses current methodologies in enhancing adversarial transferability, thereby confirming its effectiveness and practical significance. The code is available at https://github.com/RongyiZhu/L2T.

* accepted as a poster in CVPR 2024

Via

Access Paper or Ask Questions

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Apr 18, 2024

Hang Hua, Yunlong Tang, Chenliang Xu, Jiebo Luo

Figure 1 for V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Figure 2 for V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Figure 3 for V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Figure 4 for V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Abstract:Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective fine-tuning of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39\%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.

Via

Access Paper or Ask Questions

AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue

Mar 24, 2024

Yunlong Tang, Daiki Shimada, Jing Bi, Chenliang Xu

Figure 1 for AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue

Figure 2 for AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue

Figure 3 for AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue

Figure 4 for AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue

Abstract:In everyday communication, humans frequently use speech and gestures to refer to specific areas or objects, a process known as Referential Dialogue (RD). While prior studies have investigated RD through Large Language Models (LLMs) or Large Multimodal Models (LMMs) in static contexts, the exploration of Temporal Referential Dialogue (TRD) within audio-visual media remains limited. Two primary challenges hinder progress in this field: (1) the absence of comprehensive, untrimmed audio-visual video datasets with precise temporal annotations, and (2) the need for methods to integrate complex temporal auditory and visual cues effectively. To address these challenges, we introduce a novel framework to generate PU-VALOR, an extensive audio-visual dataset comprising over 114,000 untrimmed videos with accurate temporal demarcations. We also present AVicuna, featuring an Audio-Visual Tokens Interleaver (AVTI) that ensures the temporal alignment of audio-visual information. Additionally, we develop the A5-222K dataset, encompassing more than 200,000 audio-text pairings, to facilitate the audio and text alignments. Our experiments demonstrate that AVicuna can effectively handle TRD in audio-visual videos and achieve state-of-the-art performance on various audio-visual video understanding tasks, particularly in untrimmed videos. We further investigate the optimal audio-interleaving rate for interleaved audio-visual inputs, which maximizes performance on the Audio-Visual Event Dense Localization task.

Via

Access Paper or Ask Questions

Adaptive Super Resolution For One-Shot Talking-Head Generation

Mar 23, 2024

Luchuan Song, Pinxin Liu, Guojun Yin, Chenliang Xu

Abstract:The one-shot talking-head generation learns to synthesize a talking-head video with one source portrait image under the driving of same or different identity video. Usually these methods require plane-based pixel transformations via Jacobin matrices or facial image warps for novel poses generation. The constraints of using a single image source and pixel displacements often compromise the clarity of the synthesized images. Some methods try to improve the quality of synthesized videos by introducing additional super-resolution modules, but this will undoubtedly increase computational consumption and destroy the original data distribution. In this work, we propose an adaptive high-quality talking-head video generation method, which synthesizes high-resolution video without additional pre-trained modules. Specifically, inspired by existing super-resolution methods, we down-sample the one-shot source image, and then adaptively reconstruct high-frequency details via an encoder-decoder module, resulting in enhanced video clarity. Our method consistently improves the quality of generated videos through a straightforward yet effective strategy, substantiated by quantitative and qualitative evaluations. The code and demo video are available on: \url{https://github.com/Songluchuan/AdaSR-TalkingHead/}.

* 5 pages, 3 figures

Via

Access Paper or Ask Questions

Forward Learning for Gradient-based Black-box Saliency Map Generation

Mar 22, 2024

Zeliang Zhang, Mingqian Feng, Jinyang Jiang, Rongyi Zhu, Yijie Peng, Chenliang Xu

Figure 1 for Forward Learning for Gradient-based Black-box Saliency Map Generation

Figure 2 for Forward Learning for Gradient-based Black-box Saliency Map Generation

Figure 3 for Forward Learning for Gradient-based Black-box Saliency Map Generation

Figure 4 for Forward Learning for Gradient-based Black-box Saliency Map Generation

Abstract:Gradient-based saliency maps are widely used to explain deep neural network decisions. However, as models become deeper and more black-box, such as in closed-source APIs like ChatGPT, computing gradients become challenging, hindering conventional explanation methods. In this work, we introduce a novel unified framework for estimating gradients in black-box settings and generating saliency maps to interpret model decisions. We employ the likelihood ratio method to estimate output-to-input gradients and utilize them for saliency map generation. Additionally, we propose blockwise computation techniques to enhance estimation accuracy. Extensive experiments in black-box settings validate the effectiveness of our method, demonstrating accurate gradient estimation and explainability of generated saliency maps. Furthermore, we showcase the scalability of our approach by applying it to explain GPT-Vision, revealing the continued relevance of gradient-based explanation methods in the era of large, closed-source, and black-box models.

Via

Access Paper or Ask Questions

Discover and Mitigate Multiple Biased Subgroups in Image Classifiers

Mar 20, 2024

Zeliang Zhang, Mingqian Feng, Zhiheng Li, Chenliang Xu

Figure 1 for Discover and Mitigate Multiple Biased Subgroups in Image Classifiers

Figure 2 for Discover and Mitigate Multiple Biased Subgroups in Image Classifiers

Figure 3 for Discover and Mitigate Multiple Biased Subgroups in Image Classifiers

Figure 4 for Discover and Mitigate Multiple Biased Subgroups in Image Classifiers

Abstract:Machine learning models can perform well on in-distribution data but often fail on biased subgroups that are underrepresented in the training data, hindering the robustness of models for reliable applications. Such subgroups are typically unknown due to the absence of subgroup labels. Discovering biased subgroups is the key to understanding models' failure modes and further improving models' robustness. Most previous works of subgroup discovery make an implicit assumption that models only underperform on a single biased subgroup, which does not hold on in-the-wild data where multiple biased subgroups exist. In this work, we propose Decomposition, Interpretation, and Mitigation (DIM), a novel method to address a more challenging but also more practical problem of discovering multiple biased subgroups in image classifiers. Our approach decomposes the image features into multiple components that represent multiple subgroups. This decomposition is achieved via a bilinear dimension reduction method, Partial Least Square (PLS), guided by useful supervision from the image classifier. We further interpret the semantic meaning of each subgroup component by generating natural language descriptions using vision-language foundation models. Finally, DIM mitigates multiple biased subgroups simultaneously via two strategies, including the data- and model-centric strategies. Extensive experiments on CIFAR-100 and Breeds datasets demonstrate the effectiveness of DIM in discovering and mitigating multiple biased subgroups. Furthermore, DIM uncovers the failure modes of the classifier on Hard ImageNet, showcasing its broader applicability to understanding model bias in image classifiers. The code is available at https://github.com/ZhangAIPI/DIM.

* CVPR 2024. Code is available at https://github.com/ZhangAIPI/DIM

Via

Access Paper or Ask Questions

Approximated Likelihood Ratio: A Forward-Only and Parallel Framework for Boosting Neural Network Training

Mar 18, 2024

Zeliang Zhang, Jinyang Jiang, Zhuo Liu, Susan Liang, Yijie Peng, Chenliang Xu

Figure 1 for Approximated Likelihood Ratio: A Forward-Only and Parallel Framework for Boosting Neural Network Training

Figure 2 for Approximated Likelihood Ratio: A Forward-Only and Parallel Framework for Boosting Neural Network Training

Figure 3 for Approximated Likelihood Ratio: A Forward-Only and Parallel Framework for Boosting Neural Network Training

Figure 4 for Approximated Likelihood Ratio: A Forward-Only and Parallel Framework for Boosting Neural Network Training

Abstract:Efficient and biologically plausible alternatives to backpropagation in neural network training remain a challenge due to issues such as high computational complexity and additional assumptions about neural networks, which limit scalability to deeper networks. The likelihood ratio method offers a promising gradient estimation strategy but is constrained by significant memory consumption, especially when deploying multiple copies of data to reduce estimation variance. In this paper, we introduce an approximation technique for the likelihood ratio (LR) method to alleviate computational and memory demands in gradient estimation. By exploiting the natural parallelism during the backward pass using LR, we further provide a high-performance training strategy, which pipelines both the forward and backward pass, to make it more suitable for the computation on specialized hardware. Extensive experiments demonstrate the effectiveness of the approximation technique in neural network training. This work underscores the potential of the likelihood ratio method in achieving high-performance neural network training, suggesting avenues for further exploration.

Via

Access Paper or Ask Questions

OSCaR: Object State Captioning and State Change Representation

Feb 28, 2024

Nguyen Nguyen, Jing Bi, Ali Vosoughi, Yapeng Tian, Pooyan Fazli, Chenliang Xu

Figure 1 for OSCaR: Object State Captioning and State Change Representation

Figure 2 for OSCaR: Object State Captioning and State Change Representation

Figure 3 for OSCaR: Object State Captioning and State Change Representation

Figure 4 for OSCaR: Object State Captioning and State Change Representation

Abstract:The capability of intelligent models to extrapolate and comprehend changes in object states is a crucial yet demanding aspect of AI research, particularly through the lens of human interaction in real-world settings. This task involves describing complex visual environments, identifying active objects, and interpreting their changes as conveyed through language. Traditional methods, which isolate object captioning and state change detection, offer a limited view of dynamic environments. Moreover, relying on a small set of symbolic words to represent changes has restricted the expressiveness of language. To address these challenges, in this paper, we introduce the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark. OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections. It sets a new testbed for evaluating multimodal large language models (MLLMs). Our experiments demonstrate that while MLLMs show some skill, they lack a full understanding of object state changes. The benchmark includes a fine-tuned model that, despite initial capabilities, requires significant improvements in accuracy and generalization ability for effective understanding of these changes. Our code and dataset are available at https://github.com/nguyennm1024/OSCaR.

* 11 pages, 6 figures, 5 tables

Via

Access Paper or Ask Questions

Efficiently Leveraging Linguistic Priors for Scene Text Spotting

Feb 27, 2024

Nguyen Nguyen, Yapeng Tian, Chenliang Xu

Abstract:Incorporating linguistic knowledge can improve scene text recognition, but it is questionable whether the same holds for scene text spotting, which typically involves text detection and recognition. This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models. This allows the model to capture the relationship between characters in the same word. Additionally, we introduce a technique to generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning. As a result, the newly created text distributions are more informative than pure one-hot encoding, leading to improved spotting and recognition performance. Our method is simple and efficient, and it can easily be integrated into existing auto-regressive-based approaches. Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words. It significantly improves both state-of-the-art scene text spotting and recognition pipelines, achieving state-of-the-art results on several benchmarks.

* 10 pages

Via

Access Paper or Ask Questions