Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Olga Russakovsky

Motion Attribution for Video Generation

Jan 13, 2026

Xindi Wu, Despoina Paschalidou, Jun Gao, Antonio Torralba, Laura Leal-Taixé, Olga Russakovsky, Sanja Fidler, Jonathan Lorraine

Abstract:Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.

* See the project website at https://research.nvidia.com/labs/sil/projects/MOTIVE/

Via

Access Paper or Ask Questions

Seeing Beyond the Scene: Analyzing and Mitigating Background Bias in Action Recognition

Dec 17, 2025

Ellie Zhou, Jihoon Chung, Olga Russakovsky

Figure 1 for Seeing Beyond the Scene: Analyzing and Mitigating Background Bias in Action Recognition

Figure 2 for Seeing Beyond the Scene: Analyzing and Mitigating Background Bias in Action Recognition

Figure 3 for Seeing Beyond the Scene: Analyzing and Mitigating Background Bias in Action Recognition

Figure 4 for Seeing Beyond the Scene: Analyzing and Mitigating Background Bias in Action Recognition

Abstract:Human action recognition models often rely on background cues rather than human movement and pose to make predictions, a behavior known as background bias. We present a systematic analysis of background bias across classification models, contrastive text-image pretrained models, and Video Large Language Models (VLLM) and find that all exhibit a strong tendency to default to background reasoning. Next, we propose mitigation strategies for classification models and show that incorporating segmented human input effectively decreases background bias by 3.78%. Finally, we explore manual and automated prompt tuning for VLLMs, demonstrating that prompt design can steer predictions towards human-focused reasoning by 9.85%.

* Accepted to NeurIPS 2025 Workshops: SPACE in Vision, Language, and Embodied AI; and What Makes a Good Video: Next Practices in Video Generation and Evaluation

Via

Access Paper or Ask Questions

D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

Oct 22, 2025

Nobline Yoo, Olga Russakovsky, Ye Zhu

Figure 1 for D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

Figure 2 for D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

Figure 3 for D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

Figure 4 for D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

Abstract:Text-to-image (T2I) diffusion models have achieved strong performance in semantic alignment, yet they still struggle with generating the correct number of objects specified in prompts. Existing approaches typically incorporate auxiliary counting networks as external critics to enhance numeracy. However, since these critics must provide gradient guidance during generation, they are restricted to regression-based models that are inherently differentiable, thus excluding detector-based models with superior counting ability, whose count-via-enumeration nature is non-differentiable. To overcome this limitation, we propose Detector-to-Differentiable (D2D), a novel framework that transforms non-differentiable detection models into differentiable critics, thereby leveraging their superior counting ability to guide numeracy generation. Specifically, we design custom activation functions to convert detector logits into soft binary indicators, which are then used to optimize the noise prior at inference time with pre-trained T2I models. Our extensive experiments on SDXL-Turbo, SD-Turbo, and Pixart-DMD across four benchmarks of varying complexity (low-density, high-density, and multi-object scenarios) demonstrate consistent and substantial improvements in object counting accuracy (e.g., boosting up to 13.7% on D2D-Small, a 400-prompt, low-density benchmark), with minimal degradation in overall image quality and computational overhead.

* 24 pages, 14 figures

Via

Access Paper or Ask Questions

Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting

Sep 26, 2025

Asher J. Hancock, Xindi Wu, Lihan Zha, Olga Russakovsky, Anirudha Majumdar

Figure 1 for Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting

Figure 2 for Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting

Figure 3 for Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting

Figure 4 for Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting

Abstract:Fine-tuning vision-language models (VLMs) on robot teleoperation data to create vision-language-action (VLA) models is a promising paradigm for training generalist policies, but it suffers from a fundamental tradeoff: learning to produce actions often diminishes the VLM's foundational reasoning and multimodal understanding, hindering generalization to novel scenarios, instruction following, and semantic understanding. We argue that this catastrophic forgetting is due to a distribution mismatch between the VLM's internet-scale pretraining corpus and the robotics fine-tuning data. Inspired by this observation, we introduce VLM2VLA: a VLA training paradigm that first resolves this mismatch at the data level by representing low-level actions with natural language. This alignment makes it possible to train VLAs solely with Low-Rank Adaptation (LoRA), thereby minimally modifying the VLM backbone and averting catastrophic forgetting. As a result, the VLM can be fine-tuned on robot teleoperation data without fundamentally altering the underlying architecture and without expensive co-training on internet-scale VLM datasets. Through extensive Visual Question Answering (VQA) studies and over 800 real-world robotics experiments, we demonstrate that VLM2VLA preserves the VLM's core capabilities, enabling zero-shot generalization to novel tasks that require open-world semantic reasoning and multilingual instruction following.

Via

Access Paper or Ask Questions

The Impact of Coreset Selection on Spurious Correlations and Group Robustness

Jul 15, 2025

Amaya Dharmasiri, William Yang, Polina Kirichenko, Lydia Liu, Olga Russakovsky

Figure 1 for The Impact of Coreset Selection on Spurious Correlations and Group Robustness

Figure 2 for The Impact of Coreset Selection on Spurious Correlations and Group Robustness

Figure 3 for The Impact of Coreset Selection on Spurious Correlations and Group Robustness

Figure 4 for The Impact of Coreset Selection on Spurious Correlations and Group Robustness

Abstract:Coreset selection methods have shown promise in reducing the training data size while maintaining model performance for data-efficient machine learning. However, as many datasets suffer from biases that cause models to learn spurious correlations instead of causal features, it is important to understand whether and how dataset reduction methods may perpetuate, amplify, or mitigate these biases. In this work, we conduct the first comprehensive analysis of the implications of data selection on the spurious bias levels of the selected coresets and the robustness of downstream models trained on them. We use an extensive experimental setting spanning ten different spurious correlations benchmarks, five score metrics to characterize sample importance/ difficulty, and five data selection policies across a broad range of coreset sizes. Thereby, we unravel a series of nontrivial nuances in interactions between sample difficulty and bias alignment, as well as dataset bias and resultant model robustness. For example, we find that selecting coresets using embedding-based sample characterization scores runs a comparatively lower risk of inadvertently exacerbating bias than selecting using characterizations based on learning dynamics. Most importantly, our analysis reveals that although some coreset selection methods could achieve lower bias levels by prioritizing difficult samples, they do not reliably guarantee downstream robustness.

* 10 pages, 9 additional pages for Appendix

Via

Access Paper or Ask Questions

Dynamic Diffusion Schrödinger Bridge in Astrophysical Observational Inversions

Jun 11, 2025

Ye Zhu, Duo Xu, Zhiwei Deng, Jonathan C. Tan, Olga Russakovsky

Abstract:We study Diffusion Schr\"odinger Bridge (DSB) models in the context of dynamical astrophysical systems, specifically tackling observational inverse prediction tasks within Giant Molecular Clouds (GMCs) for star formation. We introduce the Astro-DSB model, a variant of DSB with the pairwise domain assumption tailored for astrophysical dynamics. By investigating its learning process and prediction performance in both physically simulated data and in real observations (the Taurus B213 data), we present two main takeaways. First, from the astrophysical perspective, our proposed paired DSB method improves interpretability, learning efficiency, and prediction performance over conventional astrostatistical and other machine learning methods. Second, from the generative modeling perspective, probabilistic generative modeling reveals improvements over discriminative pixel-to-pixel modeling in Out-Of-Distribution (OOD) testing cases of physical simulations with unseen initial conditions and different dominant physical processes. Our study expands research into diffusion models beyond the traditional visual synthesis application and provides evidence of the models' learning abilities beyond pure data statistics, paving a path for future physics-aware generative models which can align dynamics between machine learning and real (astro)physical systems.

* Preprint. Code will be available at https://github.com/L-YeZhu/AstroDSB

Via

Access Paper or Ask Questions

COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

Apr 30, 2025

Xindi Wu, Hee Seung Hwang, Polina Kirichenko, Olga Russakovsky

Abstract:Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume, but not the compositional complexity of training examples. We propose COMPACT (COMPositional Atomic-to-complex visual Capability Tuning), which generates a training dataset explicitly controlling for the compositional complexity of the training examples. The data from COMPACT allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT while using less than 10% of its data budget, and even outperforms it on several, especially those involving complex multi-capability tasks. For example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.

* 17 pages, 13 figures

Via

Access Paper or Ask Questions

Interactivity x Explainability: Toward Understanding How Interactivity Can Improve Computer Vision Explanations

Apr 14, 2025

Indu Panigrahi, Sunnie S. Y. Kim, Amna Liaqat, Rohan Jinturkar, Olga Russakovsky, Ruth Fong, Parastoo Abtahi

Abstract:Explanations for computer vision models are important tools for interpreting how the underlying models work. However, they are often presented in static formats, which pose challenges for users, including information overload, a gap between semantic and pixel-level information, and limited opportunities for exploration. We investigate interactivity as a mechanism for tackling these issues in three common explanation types: heatmap-based, concept-based, and prototype-based explanations. We conducted a study (N=24), using a bird identification task, involving participants with diverse technical and domain expertise. We found that while interactivity enhances user control, facilitates rapid convergence to relevant information, and allows users to expand their understanding of the model and explanation, it also introduces new challenges. To address these, we provide design recommendations for interactive computer vision explanations, including carefully selected default views, independent input controls, and constrained output spaces.

* To appear in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '25)

Via

Access Paper or Ask Questions

Attention IoU: Examining Biases in CelebA using Attention Maps

Mar 26, 2025

Aaron Serianni, Tyler Zhu, Olga Russakovsky, Vikram V. Ramaswamy

Figure 1 for Attention IoU: Examining Biases in CelebA using Attention Maps

Figure 2 for Attention IoU: Examining Biases in CelebA using Attention Maps

Figure 3 for Attention IoU: Examining Biases in CelebA using Attention Maps

Figure 4 for Attention IoU: Examining Biases in CelebA using Attention Maps

Abstract:Computer vision models have been shown to exhibit and amplify biases across a wide array of datasets and tasks. Existing methods for quantifying bias in classification models primarily focus on dataset distribution and model performance on subgroups, overlooking the internal workings of a model. We introduce the Attention-IoU (Attention Intersection over Union) metric and related scores, which use attention maps to reveal biases within a model's internal representations and identify image features potentially causing the biases. First, we validate Attention-IoU on the synthetic Waterbirds dataset, showing that the metric accurately measures model bias. We then analyze the CelebA dataset, finding that Attention-IoU uncovers correlations beyond accuracy disparities. Through an investigation of individual attributes through the protected attribute of Male, we examine the distinct ways biases are represented in CelebA. Lastly, by subsampling the training set to change attribute correlations, we demonstrate that Attention-IoU reveals potential confounding variables not present in dataset labels.

* To appear in CVPR 2025. Code and data is available at https://github.com/aaronserianni/attention-iou . 15 pages, 14 figures, including appendix

Via

Access Paper or Ask Questions

Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies

Feb 12, 2025

Sunnie S. Y. Kim, Jennifer Wortman Vaughan, Q. Vera Liao, Tania Lombrozo, Olga Russakovsky

Figure 1 for Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies

Figure 2 for Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies

Figure 3 for Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies

Figure 4 for Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies

Abstract:Large language models (LLMs) can produce erroneous responses that sound fluent and convincing, raising the risk that users will rely on these responses as if they were correct. Mitigating such overreliance is a key challenge. Through a think-aloud study in which participants use an LLM-infused application to answer objective questions, we identify several features of LLM responses that shape users' reliance: explanations (supporting details for answers), inconsistencies in explanations, and sources. Through a large-scale, pre-registered, controlled experiment (N=308), we isolate and study the effects of these features on users' reliance, accuracy, and other measures. We find that the presence of explanations increases reliance on both correct and incorrect responses. However, we observe less reliance on incorrect responses when sources are provided or when explanations exhibit inconsistencies. We discuss the implications of these findings for fostering appropriate reliance on LLMs.

* CHI 2025. This version includes the appendix

Via

Access Paper or Ask Questions