Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mahyar Khayatkhoei

MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

Feb 24, 2025

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski

Figure 1 for MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

Figure 2 for MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

Figure 3 for MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

Figure 4 for MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

Abstract:Multimodal Large Language Models (MLLMs) have experienced rapid progress in visual recognition tasks in recent years. Given their potential integration into many critical applications, it is important to understand the limitations of their visual perception. In this work, we study whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images. We observe that their performance is very sensitive to the size of the visual subject of the question, and further show that this effect is in fact causal by conducting an intervention study. Next, we study the attention patterns of MLLMs when answering visual questions, and intriguingly find that they consistently know where to look, even when they provide the wrong answer. Based on these findings, we then propose training-free visual intervention methods that leverage the internal knowledge of any MLLM itself, in the form of attention and gradient maps, to enhance its perception of small visual details. We evaluate our proposed methods on two widely-used MLLMs and seven visual question answering benchmarks and show that they can significantly improve MLLMs' accuracy without requiring any training. Our results elucidate the risk of applying MLLMs to visual recognition tasks concerning small details and indicate that visual intervention using the model's internal state is a promising direction to mitigate this risk.

* Published as a conference paper at ICLR 2025. Code at: https://github.com/saccharomycetes/mllms_know

Via

Access Paper or Ask Questions

Look, Learn and Leverage (L$^3$): Mitigating Visual-Domain Shift and Discovering Intrinsic Relations via Symbolic Alignment

Aug 30, 2024

Hanchen Xie, Jiageng Zhu, Mahyar Khayatkhoei, Jiazhi Li, Wael AbdAlmageed

Figure 1 for Look, Learn and Leverage (L$^3$): Mitigating Visual-Domain Shift and Discovering Intrinsic Relations via Symbolic Alignment

Figure 2 for Look, Learn and Leverage (L$^3$): Mitigating Visual-Domain Shift and Discovering Intrinsic Relations via Symbolic Alignment

Figure 3 for Look, Learn and Leverage (L$^3$): Mitigating Visual-Domain Shift and Discovering Intrinsic Relations via Symbolic Alignment

Figure 4 for Look, Learn and Leverage (L$^3$): Mitigating Visual-Domain Shift and Discovering Intrinsic Relations via Symbolic Alignment

Abstract:Modern deep learning models have demonstrated outstanding performance on discovering the underlying mechanisms when both visual appearance and intrinsic relations (e.g., causal structure) data are sufficient, such as Disentangled Representation Learning (DRL), Causal Representation Learning (CRL) and Visual Question Answering (VQA) methods. However, generalization ability of these models is challenged when the visual domain shifts and the relations data is absent during finetuning. To address this challenge, we propose a novel learning framework, Look, Learn and Leverage (L$^3$), which decomposes the learning process into three distinct phases and systematically utilize the class-agnostic segmentation masks as the common symbolic space to align visual domains. Thus, a relations discovery model can be trained on the source domain, and when the visual domain shifts and the intrinsic relations are absent, the pretrained relations discovery model can be directly reused and maintain a satisfactory performance. Extensive performance evaluations are conducted on three different tasks: DRL, CRL and VQA, and show outstanding results on all three tasks, which reveals the advantages of L$^3$.

* 17 pages, 9 figures, 6 tables

Via

Access Paper or Ask Questions

An Investigation on The Position Encoding in Vision-Based Dynamics Prediction

Aug 27, 2024

Jiageng Zhu, Hanchen Xie, Jiazhi Li, Mahyar Khayatkhoei, Wael AbdAlmageed

Figure 1 for An Investigation on The Position Encoding in Vision-Based Dynamics Prediction

Figure 2 for An Investigation on The Position Encoding in Vision-Based Dynamics Prediction

Figure 3 for An Investigation on The Position Encoding in Vision-Based Dynamics Prediction

Figure 4 for An Investigation on The Position Encoding in Vision-Based Dynamics Prediction

Abstract:Despite the success of vision-based dynamics prediction models, which predict object states by utilizing RGB images and simple object descriptions, they were challenged by environment misalignments. Although the literature has demonstrated that unifying visual domains with both environment context and object abstract, such as semantic segmentation and bounding boxes, can effectively mitigate the visual domain misalignment challenge, discussions were focused on the abstract of environment context, and the insight of using bounding box as the object abstract is under-explored. Furthermore, we notice that, as empirical results shown in the literature, even when the visual appearance of objects is removed, object bounding boxes alone, instead of being directly fed into the network, can indirectly provide sufficient position information via the Region of Interest Pooling operation for dynamics prediction. However, previous literature overlooked discussions regarding how such position information is implicitly encoded in the dynamics prediction model. Thus, in this paper, we provide detailed studies to investigate the process and necessary conditions for encoding position information via using the bounding box as the object abstract into output features. Furthermore, we study the limitation of solely using object abstracts, such that the dynamics prediction performance will be jeopardized when the environment context varies.

* 13 pages, 4 tables, and 3 figures. Accepted to ECCV2024 eXCV workshop

Via

Access Paper or Ask Questions

ManiFPT: Defining and Analyzing Fingerprints of Generative Models

Feb 29, 2024

Hae Jin Song, Mahyar Khayatkhoei, Wael AbdAlmageed

Figure 1 for ManiFPT: Defining and Analyzing Fingerprints of Generative Models

Figure 2 for ManiFPT: Defining and Analyzing Fingerprints of Generative Models

Figure 3 for ManiFPT: Defining and Analyzing Fingerprints of Generative Models

Figure 4 for ManiFPT: Defining and Analyzing Fingerprints of Generative Models

Abstract:Recent works have shown that generative models leave traces of their underlying generative process on the generated samples, broadly referred to as fingerprints of a generative model, and have studied their utility in detecting synthetic images from real ones. However, the extend to which these fingerprints can distinguish between various types of synthetic image and help identify the underlying generative process remain under-explored. In particular, the very definition of a fingerprint remains unclear, to our knowledge. To that end, in this work, we formalize the definition of artifact and fingerprint in generative models, propose an algorithm for computing them in practice, and finally study its effectiveness in distinguishing a large array of different generative models. We find that using our proposed definition can significantly improve the performance on the task of identifying the underlying generative process from samples (model attribution) compared to existing methods. Additionally, we study the structure of the fingerprints, and observe that it is very predictive of the effect of different design choices on the generative process.

* Accepted to CVPR 2024

Via

Access Paper or Ask Questions

Exploring Perceptual Limitation of Multimodal Large Language Models

Feb 12, 2024

Jiarui Zhang, Jinyi Hu, Mahyar Khayatkhoei, Filip Ilievski, Maosong Sun

Figure 1 for Exploring Perceptual Limitation of Multimodal Large Language Models

Figure 2 for Exploring Perceptual Limitation of Multimodal Large Language Models

Figure 3 for Exploring Perceptual Limitation of Multimodal Large Language Models

Figure 4 for Exploring Perceptual Limitation of Multimodal Large Language Models

Abstract:Multimodal Large Language Models (MLLMs) have recently shown remarkable perceptual capability in answering visual questions, however, little is known about the limits of their perception. In particular, while prior works have provided anecdotal evidence of MLLMs' sensitivity to object size, this phenomenon and its underlying causes have not been explored comprehensively. In this work, we quantitatively study the perception of small visual objects in several state-of-the-art MLLMs and reveal a pervasive limitation in answering questions about small objects in images. Next, we identify four independent factors that can contribute to this limitation -- object quality, size, distractors, and location -- and conduct controlled intervention studies to measure the effect of each factor on MLLMs' perception. In particular, we find that lower object quality and smaller object size can both independently reduce MLLMs' ability to answer visual questions. More surprisingly, we find that the location of the object in the image and the presence of visual distractors can also significantly reduce MLLMs' question answering accuracy. Our study provides a better understanding of the perceptual limitation of MLLMs and contributes new evaluation protocols for analyzing the perception of future MLLMs. To facilitate further investigations, we release our code and data.

* 14 pages, 14 figures, 3 tables

Via

Access Paper or Ask Questions

Unsupervised Multimodal Deepfake Detection Using Intra- and Cross-Modal Inconsistencies

Nov 28, 2023

Mulin Tian, Mahyar Khayatkhoei, Joe Mathai, Wael AbdAlmageed

Figure 1 for Unsupervised Multimodal Deepfake Detection Using Intra- and Cross-Modal Inconsistencies

Figure 2 for Unsupervised Multimodal Deepfake Detection Using Intra- and Cross-Modal Inconsistencies

Figure 3 for Unsupervised Multimodal Deepfake Detection Using Intra- and Cross-Modal Inconsistencies

Figure 4 for Unsupervised Multimodal Deepfake Detection Using Intra- and Cross-Modal Inconsistencies

Abstract:Deepfake videos present an increasing threat to society with potentially negative impact on criminal justice, democracy, and personal safety and privacy. Meanwhile, detecting deepfakes, at scale, remains a very challenging tasks that often requires labeled training data from existing deepfake generation methods. Further, even the most accurate supervised learning, deepfake detection methods do not generalize to deepfakes generated using new generation methods. In this paper, we introduce a novel unsupervised approach for detecting deepfake videos by measuring of intra- and cross-modal consistency among multimodal features; specifically visual, audio, and identity features. The fundamental hypothesis behind the proposed detection method is that since deepfake generation attempts to transfer the facial motion of one identity to another, these methods will eventually encounter a trade-off between motion and identity that enviably leads to detectable inconsistencies. We validate our method through extensive experimentation, demonstrating the existence of significant intra- and cross- modal inconsistencies in deepfake videos, which can be effectively utilized to detect them with high accuracy. Our proposed method is scalable because it does not require pristine samples at inference, generalizable because it is trained only on real data, and is explainable since it can pinpoint the exact location of modality inconsistencies which are then verifiable by a human expert.

* 11 pages, 3 figures, 2 tables

Via

Access Paper or Ask Questions

SABAF: Removing Strong Attribute Bias from Neural Networks with Adversarial Filtering

Nov 16, 2023

Jiazhi Li, Mahyar Khayatkhoei, Jiageng Zhu, Hanchen Xie, Mohamed E. Hussein, Wael AbdAlmageed

Figure 1 for SABAF: Removing Strong Attribute Bias from Neural Networks with Adversarial Filtering

Figure 2 for SABAF: Removing Strong Attribute Bias from Neural Networks with Adversarial Filtering

Figure 3 for SABAF: Removing Strong Attribute Bias from Neural Networks with Adversarial Filtering

Figure 4 for SABAF: Removing Strong Attribute Bias from Neural Networks with Adversarial Filtering

Abstract:Ensuring a neural network is not relying on protected attributes (e.g., race, sex, age) for prediction is crucial in advancing fair and trustworthy AI. While several promising methods for removing attribute bias in neural networks have been proposed, their limitations remain under-explored. To that end, in this work, we mathematically and empirically reveal the limitation of existing attribute bias removal methods in presence of strong bias and propose a new method that can mitigate this limitation. Specifically, we first derive a general non-vacuous information-theoretical upper bound on the performance of any attribute bias removal method in terms of the bias strength, revealing that they are effective only when the inherent bias in the dataset is relatively weak. Next, we derive a necessary condition for the existence of any method that can remove attribute bias regardless of the bias strength. Inspired by this condition, we then propose a new method using an adversarial objective that directly filters out protected attributes in the input space while maximally preserving all other attributes, without requiring any specific target label. The proposed method achieves state-of-the-art performance in both strong and moderate bias settings. We provide extensive experiments on synthetic, image, and census datasets, to verify the derived theoretical bound and its consequences in practice, and evaluate the effectiveness of the proposed method in removing strong attribute bias.

* 35 pages, 18 figures, 32 tables. This work is an extended version of our paper (arXiv:2310.04955). Code will be released at https://github.com/jiazhi412/strong_attribute_bias

Via

Access Paper or Ask Questions

Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models

Oct 24, 2023

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski

Figure 1 for Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models

Figure 2 for Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models

Figure 3 for Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models

Figure 4 for Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models

Abstract:Multimodal Large Language Models (LLMs) have recently achieved promising zero-shot accuracy on visual question answering (VQA) -- a fundamental task affecting various downstream applications and domains. Given the great potential for the broad use of these models, it is important to investigate their limitations in dealing with different image and question properties. In this work, we investigate whether multimodal LLMs can perceive small details as well as large details in images. In particular, we show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question, declining up to $46\%$ with size. Furthermore, we show that this effect is causal by observing that human visual cropping can significantly mitigate their sensitivity to size. Inspired by the usefulness of human cropping, we then propose three automatic visual cropping methods as inference time mechanisms to improve the zero-shot performance of multimodal LLMs. We study their effectiveness on four popular VQA datasets, and a subset of the VQAv2 dataset tailored towards fine visual details. Our findings suggest that multimodal LLMs should be used with caution in detail-sensitive VQA applications, and that visual cropping is a promising direction to improve their zero-shot performance. Our code and data are publicly available.

* 11 pages, 4 figures, 4 tables

Via

Access Paper or Ask Questions

Information-Theoretic Bounds on The Removal of Attribute-Specific Bias From Neural Networks

Oct 08, 2023

Jiazhi Li, Mahyar Khayatkhoei, Jiageng Zhu, Hanchen Xie, Mohamed E. Hussein, Wael AbdAlmageed

Figure 1 for Information-Theoretic Bounds on The Removal of Attribute-Specific Bias From Neural Networks

Figure 2 for Information-Theoretic Bounds on The Removal of Attribute-Specific Bias From Neural Networks

Figure 3 for Information-Theoretic Bounds on The Removal of Attribute-Specific Bias From Neural Networks

Figure 4 for Information-Theoretic Bounds on The Removal of Attribute-Specific Bias From Neural Networks

Abstract:Ensuring a neural network is not relying on protected attributes (e.g., race, sex, age) for predictions is crucial in advancing fair and trustworthy AI. While several promising methods for removing attribute bias in neural networks have been proposed, their limitations remain under-explored. In this work, we mathematically and empirically reveal an important limitation of attribute bias removal methods in presence of strong bias. Specifically, we derive a general non-vacuous information-theoretical upper bound on the performance of any attribute bias removal method in terms of the bias strength. We provide extensive experiments on synthetic, image, and census datasets to verify the theoretical bound and its consequences in practice. Our findings show that existing attribute bias removal methods are effective only when the inherent bias in the dataset is relatively weak, thus cautioning against the use of these methods in smaller datasets where strong attribute bias can occur, and advocating the need for methods that can overcome this limitation.

Via

Access Paper or Ask Questions

Shadow Datasets, New challenging datasets for Causal Representation Learning

Aug 11, 2023

Jiageng Zhu, Hanchen Xie, Jianhua Wu, Jiazhi Li, Mahyar Khayatkhoei, Mohamed E. Hussein, Wael AbdAlmageed

Figure 1 for Shadow Datasets, New challenging datasets for Causal Representation Learning

Figure 2 for Shadow Datasets, New challenging datasets for Causal Representation Learning

Figure 3 for Shadow Datasets, New challenging datasets for Causal Representation Learning

Figure 4 for Shadow Datasets, New challenging datasets for Causal Representation Learning

Abstract:Discovering causal relations among semantic factors is an emergent topic in representation learning. Most causal representation learning (CRL) methods are fully supervised, which is impractical due to costly labeling. To resolve this restriction, weakly supervised CRL methods were introduced. To evaluate CRL performance, four existing datasets, Pendulum, Flow, CelebA(BEARD) and CelebA(SMILE), are utilized. However, existing CRL datasets are limited to simple graphs with few generative factors. Thus we propose two new datasets with a larger number of diverse generative factors and more sophisticated causal graphs. In addition, current real datasets, CelebA(BEARD) and CelebA(SMILE), the originally proposed causal graphs are not aligned with the dataset distributions. Thus, we propose modifications to them.

Via

Access Paper or Ask Questions