Image Captioning


Image captioning is the process of generating a textual description of an image. It uses both Natural Language Processing (NLP) and Computer Vision (CV) to generate the captions.

Describe Anything: Detailed Localized Image and Video Captioning

Add code
Apr 22, 2025
Viaarxiv icon

Vision language models are unreliable at trivial spatial cognition

Add code
Apr 22, 2025
Viaarxiv icon

Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning

Add code
Apr 21, 2025
Viaarxiv icon

InstaRevive: One-Step Image Enhancement via Dynamic Score Matching

Add code
Apr 22, 2025
Viaarxiv icon

Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering

Add code
Apr 23, 2025
Viaarxiv icon

Decoupled Global-Local Alignment for Improving Compositional Understanding

Add code
Apr 23, 2025
Viaarxiv icon

Advanced Chest X-Ray Analysis via Transformer-Based Image Descriptors and Cross-Model Attention Mechanism

Add code
Apr 23, 2025
Viaarxiv icon

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

Add code
Apr 24, 2025
Viaarxiv icon

Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding

Add code
Apr 20, 2025
Viaarxiv icon

Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Add code
Apr 22, 2025
Viaarxiv icon