Picture for Jitesh Jain

Jitesh Jain

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Add code
Jan 15, 2026
Viaarxiv icon

SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

Add code
Dec 15, 2025
Viaarxiv icon

Person Recognition at Altitude and Range: Fusion of Face, Body Shape and Gait

Add code
May 07, 2025
Figure 1 for Person Recognition at Altitude and Range: Fusion of Face, Body Shape and Gait
Figure 2 for Person Recognition at Altitude and Range: Fusion of Face, Body Shape and Gait
Figure 3 for Person Recognition at Altitude and Range: Fusion of Face, Body Shape and Gait
Figure 4 for Person Recognition at Altitude and Range: Fusion of Face, Body Shape and Gait
Viaarxiv icon

Slow-Fast Architecture for Video Multi-Modal Large Language Models

Add code
Apr 02, 2025
Viaarxiv icon

OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

Add code
Dec 12, 2024
Viaarxiv icon

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

Add code
May 09, 2024
Figure 1 for CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Figure 2 for CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Figure 3 for CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Figure 4 for CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Viaarxiv icon

Benchmarking Object Detectors with COCO: A New Path Forward

Add code
Mar 27, 2024
Viaarxiv icon

VCoder: Versatile Vision Encoders for Multimodal Large Language Models

Add code
Dec 21, 2023
Viaarxiv icon

Matting Anything

Add code
Jun 08, 2023
Viaarxiv icon

OneFormer: One Transformer to Rule Universal Image Segmentation

Add code
Nov 10, 2022
Figure 1 for OneFormer: One Transformer to Rule Universal Image Segmentation
Figure 2 for OneFormer: One Transformer to Rule Universal Image Segmentation
Figure 3 for OneFormer: One Transformer to Rule Universal Image Segmentation
Figure 4 for OneFormer: One Transformer to Rule Universal Image Segmentation
Viaarxiv icon