Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shivang Chopra

MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding

Jun 11, 2025

Shivang Chopra, Gabriela Sanchez-Rodriguez, Lingchao Mao, Andrew J Feola, Jing Li, Zsolt Kira

Abstract:Different medical imaging modalities capture diagnostic information at varying spatial resolutions, from coarse global patterns to fine-grained localized structures. However, most existing vision-language frameworks in the medical domain apply a uniform strategy for local feature extraction, overlooking the modality-specific demands. In this work, we present MedMoE, a modular and extensible vision-language processing framework that dynamically adapts visual representation based on the diagnostic context. MedMoE incorporates a Mixture-of-Experts (MoE) module conditioned on the report type, which routes multi-scale image features through specialized expert branches trained to capture modality-specific visual semantics. These experts operate over feature pyramids derived from a Swin Transformer backbone, enabling spatially adaptive attention to clinically relevant regions. This framework produces localized visual representations aligned with textual descriptions, without requiring modality-specific supervision at inference. Empirical results on diverse medical benchmarks demonstrate that MedMoE improves alignment and retrieval performance across imaging modalities, underscoring the value of modality-specialized visual representations in clinical vision-language systems.

Via

Access Paper or Ask Questions

Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models

Jun 09, 2025

Chengyue Huang, Yuchen Zhu, Sichen Zhu, Jingyun Xiao, Moises Andrade, Shivang Chopra, Zsolt Kira

Abstract:Vision-language models (VLMs) are widely assumed to exhibit in-context learning (ICL), a property similar to that of their language-only counterparts. While recent work suggests VLMs can perform multimodal ICL (MM-ICL), studies show they often rely on shallow heuristics -- such as copying or majority voting -- rather than true task understanding. We revisit this assumption by evaluating VLMs under distribution shifts, where support examples come from a dataset different from the query. Surprisingly, performance often degrades with more demonstrations, and models tend to copy answers rather than learn from them. To investigate further, we propose a new MM-ICL with Reasoning pipeline that augments each demonstration with a generated rationale alongside the answer. We conduct extensive and comprehensive experiments on both perception- and reasoning-required datasets with open-source VLMs ranging from 3B to 72B and proprietary models such as Gemini 2.0. We conduct controlled studies varying shot count, retrieval method, rationale quality, and distribution. Our results show limited performance sensitivity across these factors, suggesting that current VLMs do not effectively utilize demonstration-level information as intended in MM-ICL.

Via

Access Paper or Ask Questions

FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering

May 27, 2025

Chengyue Huang, Brisa Maneechotesuwan, Shivang Chopra, Zsolt Kira

Abstract:Visual question answering (VQA) systems face significant challenges when adapting to real-world data shifts, especially in multi-modal contexts. While robust fine-tuning strategies are essential for maintaining performance across in-distribution (ID) and out-of-distribution (OOD) scenarios, current evaluation settings are primarily unimodal or particular to some types of OOD, offering limited insight into the complexities of multi-modal contexts. In this work, we propose a new benchmark FRAMES-VQA (Fine-Tuning Robustness across Multi-Modal Shifts in VQA) for evaluating robust fine-tuning for VQA tasks. We utilize ten existing VQA benchmarks, including VQAv2, IV-VQA, VQA-CP, OK-VQA and others, and categorize them into ID, near and far OOD datasets covering uni-modal, multi-modal and adversarial distribution shifts. We first conduct a comprehensive comparison of existing robust fine-tuning methods. We then quantify the distribution shifts by calculating the Mahalanobis distance using uni-modal and multi-modal embeddings extracted from various models. Further, we perform an extensive analysis to explore the interactions between uni- and multi-modal shifts as well as modality importance for ID and OOD samples. These analyses offer valuable guidance on developing more robust fine-tuning methods to handle multi-modal distribution shifts. The code is available at https://github.com/chengyuehuang511/FRAMES-VQA .

* Accepted to CVPR 2025

Via

Access Paper or Ask Questions

Directional Gradient Projection for Robust Fine-Tuning of Foundation Models

Feb 21, 2025

Chengyue Huang, Junjiao Tian, Brisa Maneechotesuwan, Shivang Chopra, Zsolt Kira

Abstract:Robust fine-tuning aims to adapt large foundation models to downstream tasks while preserving their robustness to distribution shifts. Existing methods primarily focus on constraining and projecting current model towards the pre-trained initialization based on the magnitudes between fine-tuned and pre-trained weights, which often require extensive hyper-parameter tuning and can sometimes result in underfitting. In this work, we propose Directional Gradient Projection (DiGraP), a novel layer-wise trainable method that incorporates directional information from gradients to bridge regularization and multi-objective optimization. Besides demonstrating our method on image classification, as another contribution we generalize this area to the multi-modal evaluation settings for robust fine-tuning. Specifically, we first bridge the uni-modal and multi-modal gap by performing analysis on Image Classification reformulated Visual Question Answering (VQA) benchmarks and further categorize ten out-of-distribution (OOD) VQA datasets by distribution shift types and degree (i.e. near versus far OOD). Experimental results show that DiGraP consistently outperforms existing baselines across Image Classfication and VQA tasks with discriminative and generative backbones, improving both in-distribution (ID) generalization and OOD robustness.

* Accepted to ICLR 2025

Via

Access Paper or Ask Questions

Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation

Mar 25, 2024

Sanyam Lakhanpal, Shivang Chopra, Vinija Jain, Aman Chadha, Man Luo

Figure 1 for Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation

Figure 2 for Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation

Figure 3 for Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation

Figure 4 for Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation

Abstract:Over the past few years, Text-to-Image (T2I) generation approaches based on diffusion models have gained significant attention. However, vanilla diffusion models often suffer from spelling inaccuracies in the text displayed within the generated images. The capability to generate visual text is crucial, offering both academic interest and a wide range of practical applications. To produce accurate visual text images, state-of-the-art techniques adopt a glyph-controlled image generation approach, consisting of a text layout generator followed by an image generator that is conditioned on the generated text layout. Nevertheless, our study reveals that these models still face three primary challenges, prompting us to develop a testbed to facilitate future research. We introduce a benchmark, LenCom-Eval, specifically designed for testing models' capability in generating images with Lengthy and Complex visual text. Subsequently, we introduce a training-free framework to enhance the two-stage generation approaches. We examine the effectiveness of our approach on both LenCom-Eval and MARIO-Eval benchmarks and demonstrate notable improvements across a range of evaluation metrics, including CLIPScore, OCR precision, recall, F1 score, accuracy, and edit distance scores. For instance, our proposed framework improves the backbone model, TextDiffuser, by more than 23\% and 13.5\% in terms of OCR word F1 on LenCom-Eval and MARIO-Eval, respectively. Our work makes a unique contribution to the field by focusing on generating images with long and rare text sequences, a niche previously unexplored by existing literature

Via

Access Paper or Ask Questions

Source-Free Domain Adaptation with Diffusion-Guided Source Data Generation

Feb 07, 2024

Shivang Chopra, Suraj Kothawade, Houda Aynaou, Aman Chadha

Figure 1 for Source-Free Domain Adaptation with Diffusion-Guided Source Data Generation

Figure 2 for Source-Free Domain Adaptation with Diffusion-Guided Source Data Generation

Figure 3 for Source-Free Domain Adaptation with Diffusion-Guided Source Data Generation

Figure 4 for Source-Free Domain Adaptation with Diffusion-Guided Source Data Generation

Abstract:This paper introduces a novel approach to leverage the generalizability capability of Diffusion Models for Source-Free Domain Adaptation (DM-SFDA). Our proposed DM-SFDA method involves fine-tuning a pre-trained text-to-image diffusion model to generate source domain images using features from the target images to guide the diffusion process. Specifically, the pre-trained diffusion model is fine-tuned to generate source samples that minimize entropy and maximize confidence for the pre-trained source model. We then apply established unsupervised domain adaptation techniques to align the generated source images with target domain data. We validate our approach through comprehensive experiments across a range of datasets, including Office-31, Office-Home, and VisDA. The results highlight significant improvements in SFDA performance, showcasing the potential of diffusion models in generating contextually relevant, domain-specific images.

* arXiv admin note: substantial text overlap with arXiv:2310.01701

Via

Access Paper or Ask Questions

Learning to Discern: Imitating Heterogeneous Human Demonstrations with Preference and Representation Learning

Oct 22, 2023

Sachit Kuhar, Shuo Cheng, Shivang Chopra, Matthew Bronars, Danfei Xu

Figure 1 for Learning to Discern: Imitating Heterogeneous Human Demonstrations with Preference and Representation Learning

Figure 2 for Learning to Discern: Imitating Heterogeneous Human Demonstrations with Preference and Representation Learning

Figure 3 for Learning to Discern: Imitating Heterogeneous Human Demonstrations with Preference and Representation Learning

Figure 4 for Learning to Discern: Imitating Heterogeneous Human Demonstrations with Preference and Representation Learning

Abstract:Practical Imitation Learning (IL) systems rely on large human demonstration datasets for successful policy learning. However, challenges lie in maintaining the quality of collected data and addressing the suboptimal nature of some demonstrations, which can compromise the overall dataset quality and hence the learning outcome. Furthermore, the intrinsic heterogeneity in human behavior can produce equally successful but disparate demonstrations, further exacerbating the challenge of discerning demonstration quality. To address these challenges, this paper introduces Learning to Discern (L2D), an offline imitation learning framework for learning from demonstrations with diverse quality and style. Given a small batch of demonstrations with sparse quality labels, we learn a latent representation for temporally embedded trajectory segments. Preference learning in this latent space trains a quality evaluator that generalizes to new demonstrators exhibiting different styles. Empirically, we show that L2D can effectively assess and learn from varying demonstrations, thereby leading to improved policy performance across a range of tasks in both simulations and on a physical robot.

* To appear at the 7th Annual Conference on Robot Learning (CoRL) 2023

Via

Access Paper or Ask Questions

Transcending Domains through Text-to-Image Diffusion: A Source-Free Approach to Domain Adaptation

Oct 14, 2023

Shivang Chopra, Suraj Kothawade, Houda Aynaou, Aman Chadha

Figure 1 for Transcending Domains through Text-to-Image Diffusion: A Source-Free Approach to Domain Adaptation

Figure 2 for Transcending Domains through Text-to-Image Diffusion: A Source-Free Approach to Domain Adaptation

Figure 3 for Transcending Domains through Text-to-Image Diffusion: A Source-Free Approach to Domain Adaptation

Figure 4 for Transcending Domains through Text-to-Image Diffusion: A Source-Free Approach to Domain Adaptation

Abstract:Domain Adaptation (DA) is a method for enhancing a model's performance on a target domain with inadequate annotated data by applying the information the model has acquired from a related source domain with sufficient labeled data. The escalating enforcement of data-privacy regulations like HIPAA, COPPA, FERPA, etc. have sparked a heightened interest in adapting models to novel domains while circumventing the need for direct access to the source data, a problem known as Source-Free Domain Adaptation (SFDA). In this paper, we propose a novel framework for SFDA that generates source data using a text-to-image diffusion model trained on the target domain samples. Our method starts by training a text-to-image diffusion model on the labeled target domain samples, which is then fine-tuned using the pre-trained source model to generate samples close to the source data. Finally, we use Domain Adaptation techniques to align the artificially generated source data with the target domain data, resulting in significant performance improvements of the model on the target domain. Through extensive comparison against several baselines on the standard Office-31, Office-Home, and VisDA benchmarks, we demonstrate the effectiveness of our approach for the SFDA task.

* 9 pages, 6 figures, 4 tables

Via

Access Paper or Ask Questions

Active Data Discovery: Mining Unknown Data using Submodular Information Measures

Jun 17, 2022

Suraj Kothawade, Shivang Chopra, Saikat Ghosh, Rishabh Iyer

Figure 1 for Active Data Discovery: Mining Unknown Data using Submodular Information Measures

Figure 2 for Active Data Discovery: Mining Unknown Data using Submodular Information Measures

Figure 3 for Active Data Discovery: Mining Unknown Data using Submodular Information Measures

Figure 4 for Active Data Discovery: Mining Unknown Data using Submodular Information Measures

Abstract:Active Learning is a very common yet powerful framework for iteratively and adaptively sampling subsets of the unlabeled sets with a human in the loop with the goal of achieving labeling efficiency. Most real world datasets have imbalance either in classes and slices, and correspondingly, parts of the dataset are rare. As a result, there has been a lot of work in designing active learning approaches for mining these rare data instances. Most approaches assume access to a seed set of instances which contain these rare data instances. However, in the event of more extreme rareness, it is reasonable to assume that these rare data instances (either classes or slices) may not even be present in the seed labeled set, and a critical need for the active learning paradigm is to efficiently discover these rare data instances. In this work, we provide an active data discovery framework which can mine unknown data slices and classes efficiently using the submodular conditional gain and submodular conditional mutual information functions. We provide a general algorithmic framework which works in a number of scenarios including image classification and object detection and works with both rare classes and rare slices present in the unlabeled set. We show significant accuracy and labeling efficiency gains with our approach compared to existing state-of-the-art active learning approaches for actively discovering these rare classes and slices.

Via

Access Paper or Ask Questions

Open Domain Suggestion Mining Leveraging Fine-Grained Analysis

Jul 11, 2020

Shreya Singal, Tanishq Goel, Shivang Chopra, Sonika Dahiya

Figure 1 for Open Domain Suggestion Mining Leveraging Fine-Grained Analysis

Figure 2 for Open Domain Suggestion Mining Leveraging Fine-Grained Analysis

Figure 3 for Open Domain Suggestion Mining Leveraging Fine-Grained Analysis

Figure 4 for Open Domain Suggestion Mining Leveraging Fine-Grained Analysis

Abstract:Suggestion mining tasks are often semantically complex and lack sophisticated methodologies that can be applied to real-world data. The presence of suggestions across a large diversity of domains and the absence of large labelled and balanced datasets render this task particularly challenging to deal with. In an attempt to overcome these challenges, we propose a two-tier pipeline that leverages Discourse Marker based oversampling and fine-grained suggestion mining techniques to retrieve suggestions from online forums. Through extensive comparison on a real-world open-domain suggestion dataset, we demonstrate how the oversampling technique combined with transformer based fine-grained analysis can beat the state of the art. Additionally, we perform extensive qualitative and qualitative analysis to give construct validity to our proposed pipeline. Finally, we discuss the practical, computational and reproducibility aspects of the deployment of our pipeline across the web.

Via

Access Paper or Ask Questions