Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Mar 05, 2025

Huang Huang, Fangchen Liu, Letian Fu, Tingfan Wu, Mustafa Mukadam, Jitendra Malik, Ken Goldberg, Pieter Abbeel

Figure 1 for OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Figure 2 for OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Figure 3 for OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Figure 4 for OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Share this with someone who'll enjoy it:

Abstract:Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained visionlanguage models (VLMs) as visual and language features are independently fed into downstream policies, degrading the pre-trained semantic alignments. We propose OTTER, a novel VLA architecture that leverages these existing alignments through explicit, text-aware visual feature extraction. Instead of processing all visual features, OTTER selectively extracts and passes only task-relevant visual features that are semantically aligned with the language instruction to the policy transformer. This allows OTTER to keep the pre-trained vision-language encoders frozen. Thereby, OTTER preserves and utilizes the rich semantic understanding learned from large-scale pre-training, enabling strong zero-shot generalization capabilities. In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zeroshot generalization to novel objects and environments. Video, code, checkpoints, and dataset: https://ottervla.github.io/.

View paper on

Share this with someone who'll enjoy it:

Title:OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Paper and Code