Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Huu-Thien Tran

Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

Aug 19, 2025

Thanh-Dat Truong, Huu-Thien Tran, Tran Thai Son, Bhiksha Raj, Khoa Luu

Abstract:Large multimodal models (LMMs) have gained impressive performance due to their outstanding capability in various understanding tasks. However, these models still suffer from some fundamental limitations related to robustness and generalization due to the alignment and correlation between visual and textual features. In this paper, we introduce a simple but efficient learning mechanism for improving the robust alignment between visual and textual modalities by solving shuffling problems. In particular, the proposed approach can improve reasoning capability, visual understanding, and cross-modality alignment by introducing two new tasks: reconstructing the image order and the text order into the LMM's pre-training and fine-tuning phases. In addition, we propose a new directed-token approach to capture visual and textual knowledge, enabling the capability to reconstruct the correct order of visual inputs. Then, we introduce a new Image-to-Response Guided loss to further improve the visual understanding of the LMM in its responses. The proposed approach consistently achieves state-of-the-art (SoTA) performance compared with prior LMMs on academic task-oriented and instruction-following LMM benchmarks.

Via

Access Paper or Ask Questions

BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models

May 30, 2025

Huu-Thien Tran, Thanh-Dat Truong, Khoa Luu

Figure 1 for BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models

Figure 2 for BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models

Figure 3 for BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models

Figure 4 for BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models

Abstract:Large vision-language models have become widely adopted to advance in various domains. However, developing a trustworthy system with minimal interpretable characteristics of large-scale models presents a significant challenge. One of the most prevalent terms associated with the fallacy functions caused by these systems is hallucination, where the language model generates a response that does not correspond to the visual content. To mitigate this problem, several approaches have been developed, and one prominent direction is to ameliorate the decoding process. In this paper, we propose a new Bijective Maximum Likelihood Learning (BIMA) approach to hallucination mitigation using normalizing flow theories. The proposed BIMA method can efficiently mitigate the hallucination problem in prevailing vision-language models, resulting in significant improvements. Notably, BIMA achieves the average F1 score of 85.06% on POPE benchmark and remarkably reduce CHAIRS and CHAIRI by 7.6% and 2.6%, respectively. To the best of our knowledge, this is one of the first studies that contemplates the bijection means to reduce hallucination induced by large vision-language models.

* CVPRW 2025, 8 pages, 4 figures

Via

Access Paper or Ask Questions

MEX: Memory-efficient Approach to Referring Multi-Object Tracking

Feb 19, 2025

Huu-Thien Tran, Phuoc-Sang Pham, Thai-Son Tran, Khoa Luu

Abstract:Referring Multi-Object Tracking (RMOT) is a relatively new concept that has rapidly gained traction as a promising research direction at the intersection of computer vision and natural language processing. Unlike traditional multi-object tracking, RMOT identifies and tracks objects and incorporates textual descriptions for object class names, making the approach more intuitive. Various techniques have been proposed to address this challenging problem; however, most require the training of the entire network due to their end-to-end nature. Among these methods, iKUN has emerged as a particularly promising solution. Therefore, we further explore its pipeline and enhance its performance. In this paper, we introduce a practical module dubbed Memory-Efficient Cross-modality -- MEX. This memory-efficient technique can be directly applied to off-the-shelf trackers like iKUN, resulting in significant architectural improvements. Our method proves effective during inference on a single GPU with 4 GB of memory. Among the various benchmarks, the Refer-KITTI dataset, which offers diverse autonomous driving scenes with relevant language expressions, is particularly useful for studying this problem. Empirically, our method demonstrates effectiveness and efficiency regarding HOTA tracking scores, substantially improving memory allocation and processing speed.

* 6 pages, 6 figures, 2024 International Conference on Advanced Technologies for Communications (ATC), Signal Processing Track

Via

Access Paper or Ask Questions