Abstract:Vision-Language Models (VLMs) lag behind Large Language Models due to the scarcity of annotated datasets, as creating paired visual-textual annotations is labor-intensive and expensive. To address this bottleneck, we introduce SAM2Auto, the first fully automated annotation pipeline for video datasets requiring no human intervention or dataset-specific training. Our approach consists of two key components: SMART-OD, a robust object detection system that combines automatic mask generation with open-world object detection capabilities, and FLASH (Frame-Level Annotation and Segmentation Handler), a multi-object real-time video instance segmentation (VIS) that maintains consistent object identification across video frames even with intermittent detection gaps. Unlike existing open-world detection methods that require frame-specific hyperparameter tuning and suffer from numerous false positives, our system employs statistical approaches to minimize detection errors while ensuring consistent object tracking throughout entire video sequences. Extensive experimental validation demonstrates that SAM2Auto achieves comparable accuracy to manual annotation while dramatically reducing annotation time and eliminating labor costs. The system successfully handles diverse datasets without requiring retraining or extensive parameter adjustments, making it a practical solution for large-scale dataset creation. Our work establishes a new baseline for automated video annotation and provides a pathway for accelerating VLM development by addressing the fundamental dataset bottleneck that has constrained progress in vision-language understanding.