Abstract:Large vision-language models have become widely adopted to advance in various domains. However, developing a trustworthy system with minimal interpretable characteristics of large-scale models presents a significant challenge. One of the most prevalent terms associated with the fallacy functions caused by these systems is hallucination, where the language model generates a response that does not correspond to the visual content. To mitigate this problem, several approaches have been developed, and one prominent direction is to ameliorate the decoding process. In this paper, we propose a new Bijective Maximum Likelihood Learning (BIMA) approach to hallucination mitigation using normalizing flow theories. The proposed BIMA method can efficiently mitigate the hallucination problem in prevailing vision-language models, resulting in significant improvements. Notably, BIMA achieves the average F1 score of 85.06% on POPE benchmark and remarkably reduce CHAIRS and CHAIRI by 7.6% and 2.6%, respectively. To the best of our knowledge, this is one of the first studies that contemplates the bijection means to reduce hallucination induced by large vision-language models.
Abstract:Referring Multi-Object Tracking (RMOT) is a relatively new concept that has rapidly gained traction as a promising research direction at the intersection of computer vision and natural language processing. Unlike traditional multi-object tracking, RMOT identifies and tracks objects and incorporates textual descriptions for object class names, making the approach more intuitive. Various techniques have been proposed to address this challenging problem; however, most require the training of the entire network due to their end-to-end nature. Among these methods, iKUN has emerged as a particularly promising solution. Therefore, we further explore its pipeline and enhance its performance. In this paper, we introduce a practical module dubbed Memory-Efficient Cross-modality -- MEX. This memory-efficient technique can be directly applied to off-the-shelf trackers like iKUN, resulting in significant architectural improvements. Our method proves effective during inference on a single GPU with 4 GB of memory. Among the various benchmarks, the Refer-KITTI dataset, which offers diverse autonomous driving scenes with relevant language expressions, is particularly useful for studying this problem. Empirically, our method demonstrates effectiveness and efficiency regarding HOTA tracking scores, substantially improving memory allocation and processing speed.