Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anelia Angelova

Unlocking Multi-Spectral Data for Multi-Modal Models with Guided Inputs and Chain-of-Thought Reasoning

Apr 22, 2026

Dahun Kim, Ganesh Satish Mallya, Anelia Angelova

Abstract:Multi-spectral imagery is a valuable input signal for Remote Sensing applications, such as land-use and land-cover classification and environmental monitoring. However, generalist Large Multi-modal Models (LMMs) are typically trained on RGB images, limiting their applicability to the RGB domain. At the same time, training multi-spectral multi-modal models is expensive and produces uniquely specialized models. To address this, we propose a novel training-free approach that introduces multi-spectral data within the inference pipeline of standard RGB-only LMMs, allowing large gains in performance. Our approach leverages the LMMs' understanding of the visual space by adapting non-RGB inputs to that space and injecting domain-specific information and Chain-of-Thought reasoning as instructions. We demonstrate this with the Gemini 2.5 model and observe strong Zero-Shot performance gains on popular Remote Sensing benchmarks. These results highlight the potential for geospatial professionals to leverage powerful generalist models for specialized sensor inputs, benefiting from rich reasoning capabilities grounded in specialized data.

* Accepted to IGARSS 2026

Via

Access Paper or Ask Questions

CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning

Jan 15, 2026

Darshan Singh, Arsha Nagrani, Kawshik Manikantan, Harman Singh, Dinesh Tewari, Tobias Weyand, Cordelia Schmid, Anelia Angelova, Shachi Dave

Abstract:Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce CURVE (Cultural Understanding and Reasoning in Video Evaluation), a challenging benchmark for multicultural and multilingual video reasoning. CURVE comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, CURVE provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on CURVE requires a deeply situated understanding of visual cultural context. Furthermore, we leverage CURVE's reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. CURVE will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural

Via

Access Paper or Ask Questions

Time-Scaling State-Space Models for Dense Video Captioning

Sep 03, 2025

AJ Piergiovanni, Ganesh Satish Mallya, Dahun Kim, Anelia Angelova

Abstract:Dense video captioning is a challenging video understanding task which aims to simultaneously segment the video into a sequence of meaningful consecutive events and to generate detailed captions to accurately describe each event. Existing methods often encounter difficulties when working with the long videos associated with dense video captioning, due to the computational complexity and memory limitations. Furthermore, traditional approaches require the entire video as input, in order to produce an answer, which precludes online processing of the video. We address these challenges by time-scaling State-Space Models (SSMs) to even longer sequences than before. Our approach, State-Space Models with Transfer State, combines both the long-sequence and recurrent properties of SSMs and addresses the main limitation of SSMs which are otherwise not able to sustain their state for very long contexts, effectively scaling SSMs further in time. The proposed model is particularly suitable for generating captions on-the-fly, in an online or streaming manner, without having to wait for the full video to be processed, which is more beneficial in practice. When applied to dense video captioning, our approach scales well with video lengths and uses 7x fewer FLOPs.

* BMVC 2025

Via

Access Paper or Ask Questions

VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

Apr 10, 2025

Dahun Kim, AJ Piergiovanni, Ganesh Mallya, Anelia Angelova

Figure 1 for VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

Figure 2 for VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

Figure 3 for VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

Figure 4 for VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

Abstract:We introduce VideoComp, a benchmark and learning framework for advancing video-text compositionality understanding, aimed at improving vision-language models (VLMs) in fine-grained temporal alignment. Unlike existing benchmarks focused on static image-text compositionality or isolated single-event videos, our benchmark targets alignment in continuous multi-event videos. Leveraging video-text datasets with temporally localized event captions (e.g. ActivityNet-Captions, YouCook2), we construct two compositional benchmarks, ActivityNet-Comp and YouCook2-Comp. We create challenging negative samples with subtle temporal disruptions such as reordering, action word replacement, partial captioning, and combined disruptions. These benchmarks comprehensively test models' compositional sensitivity across extended, cohesive video-text sequences. To improve model performance, we propose a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and gradually penalizes increasingly disrupted ones, encouraging fine-grained compositional learning. To mitigate the limited availability of densely annotated video data, we introduce a pretraining strategy that concatenates short video-caption pairs to simulate multi-event sequences. We evaluate video-text foundational models and large multimodal models (LMMs) on our benchmark, identifying both strengths and areas for improvement in compositionality. Overall, our work provides a comprehensive framework for evaluating and enhancing model capabilities in achieving fine-grained, temporally coherent video-text alignment.

* CVPR 2025, project page at https://github.com/google-deepmind/video_comp

Via

Access Paper or Ask Questions

Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

Nov 22, 2024

AJ Piergiovanni, Dahun Kim, Michael S. Ryoo, Isaac Noble, Anelia Angelova

Figure 1 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

Figure 2 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

Figure 3 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

Figure 4 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

Abstract:Generating automatic dense captions for videos that accurately describe their contents remains a challenging area of research. Most current models require processing the entire video at once. Instead, we propose an efficient, online approach which outputs frequent, detailed and temporally aligned captions, without access to future frames. Our model uses a novel autoregressive factorized decoding architecture, which models the sequence of visual features for each time segment, outputting localized descriptions and efficiently leverages the context from the previous video segments. This allows the model to output frequent, detailed captions to more comprehensively describe the video, according to its actual local content, rather than mimic the training data. Second, we propose an optimization for efficient training and inference, which enables scaling to longer videos. Our approach shows excellent performance compared to both offline and online methods, and uses 20\% less compute. The annotations produced are much more comprehensive and frequent, and can further be utilized in automatic video tagging and in large-scale video data harvesting.

Via

Access Paper or Ask Questions

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Jan 04, 2024

Zihao Xiao, Longlong Jing, Shangxuan Wu, Alex Zihao Zhu, Jingwei Ji, Chiyu Max Jiang, Wei-Chih Hung, Thomas Funkhouser, Weicheng Kuo, Anelia Angelova(+2 more)

Figure 1 for 3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Figure 2 for 3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Figure 3 for 3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Figure 4 for 3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Abstract:3D panoptic segmentation is a challenging perception task, which aims to predict both semantic and instance annotations for 3D points in a scene. Although prior 3D panoptic segmentation approaches have achieved great performance on closed-set benchmarks, generalizing to novel categories remains an open problem. For unseen object categories, 2D open-vocabulary segmentation has achieved promising results that solely rely on frozen CLIP backbones and ensembling multiple classification outputs. However, we find that simply extending these 2D models to 3D does not achieve good performance due to poor per-mask classification quality on novel categories. In this paper, we propose the first method to tackle 3D open-vocabulary panoptic segmentation. Our model takes advantage of the fusion between learnable LiDAR features and dense frozen vision CLIP features, using a single classification head to make predictions for both base and novel classes. To further improve the classification performance on novel classes and leverage the CLIP model, we propose two novel loss functions: object-level distillation loss and voxel-level distillation loss. Our experiments on the nuScenes and SemanticKITTI datasets show that our method outperforms strong baselines by a large margin.

Via

Access Paper or Ask Questions

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

Nov 13, 2023

AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova

Figure 1 for Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

Figure 2 for Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

Figure 3 for Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

Figure 4 for Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

Abstract:One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we propose to further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly within a timeframe. The Combiner learns to extract audio and video features from raw spatio-temporal signals, and then learns to fuse these features producing compact but expressive representations per snippet. Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models. It effectively addresses the high computational demand of media inputs by both learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time.

Via

Access Paper or Ask Questions

Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

Sep 29, 2023

Dahun Kim, Anelia Angelova, Weicheng Kuo

Figure 1 for Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

Figure 2 for Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

Figure 3 for Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

Figure 4 for Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

Abstract:We present a new open-vocabulary detection approach based on detection-oriented image-text pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we replace the commonly used classification architecture with the detector architecture, which better serves the region-level recognition needs of detection by enabling the detector heads to learn from noisy image-text pairs. Using only standard contrastive loss and no pseudo-labeling, our approach is a simple yet effective extension of the contrastive learning method to learn emergent object-semantic cues. In addition, we propose a shifted-window learning approach upon window attention to make the backbone representation more robust, translation-invariant, and less biased by the window pattern. On the popular LVIS open-vocabulary detection benchmark, our approach sets a new state of the art of 40.4 mask AP$_r$ using the common ViT-L backbone, significantly outperforming the best existing approach by +6.5 mask AP$_r$ at system level. On the COCO benchmark, we achieve very competitive 40.8 novel AP without pseudo labeling or weak supervision. In addition, we evaluate our approach on the transfer detection setup, where ours outperforms the baseline significantly. Visualization reveals emerging object locality from the pretraining recipes compared to the baseline. Code and models will be publicly released.

* Tech report

Via

Access Paper or Ask Questions

Contrastive Feature Masking Open-Vocabulary Vision Transformer

Sep 02, 2023

Dahun Kim, Anelia Angelova, Weicheng Kuo

Figure 1 for Contrastive Feature Masking Open-Vocabulary Vision Transformer

Figure 2 for Contrastive Feature Masking Open-Vocabulary Vision Transformer

Figure 3 for Contrastive Feature Masking Open-Vocabulary Vision Transformer

Figure 4 for Contrastive Feature Masking Open-Vocabulary Vision Transformer

Abstract:We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard MAE, we perform reconstruction in the joint image-text embedding space, rather than the pixel space as is customary with the classical MAE method, which causes the model to better learn region-level semantics. Moreover, we introduce Positional Embedding Dropout (PED) to address scale variation between image-text pretraining and detection finetuning by randomly dropping out the positional embeddings during pretraining. PED improves detection performance and enables the use of a frozen ViT backbone as a region classifier, preventing the forgetting of open-vocabulary knowledge during detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT achieves a state-of-the-art 33.9 AP$r$, surpassing the best approach by 7.6 points and achieves better zero-shot detection transfer. Finally, CFM-ViT acquires strong image-level representation, outperforming the state of the art on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks.

* Accepted to ICCV 2023

Via

Access Paper or Ask Questions

Diversifying Joint Vision-Language Tokenization Learning

Jun 15, 2023

Vardaan Pahuja, AJ Piergiovanni, Anelia Angelova

Abstract:Building joint representations across images and text is an essential step for tasks such as Visual Question Answering and Video Question Answering. In this work, we find that the representations must not only jointly capture features from both modalities but should also be diverse for better generalization performance. To this end, we propose joint vision-language representation learning by diversifying the tokenization learning process, enabling tokens that are sufficiently disentangled from each other to be learned from both modalities. We observe that our approach outperforms the baseline models in a majority of settings and is competitive with state-of-the-art methods.

* Accepted to Transformers for Vision (T4V) workshop, CVPR 2023; 7 pages, 5 figures

Via

Access Paper or Ask Questions