Picture for Bin Wen

Bin Wen

VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform

Add code
Apr 21, 2025
Viaarxiv icon

InstructEngine: Instruction-driven Text-to-Image Alignment

Add code
Apr 14, 2025
Viaarxiv icon

Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models

Add code
Apr 09, 2025
Viaarxiv icon

Wan: Open and Advanced Large-Scale Video Generative Models

Add code
Mar 26, 2025
Viaarxiv icon

TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs

Add code
Mar 13, 2025
Viaarxiv icon

Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding

Add code
Mar 12, 2025
Viaarxiv icon

RecipeGen: A Benchmark for Real-World Recipe Image Generation

Add code
Mar 07, 2025
Viaarxiv icon

What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs

Add code
Feb 19, 2025
Viaarxiv icon

Kwai-STaR: Transform LLMs into State-Transition Reasoners

Add code
Nov 07, 2024
Viaarxiv icon

EVLM: An Efficient Vision-Language Model for Visual Understanding

Add code
Jul 19, 2024
Figure 1 for EVLM: An Efficient Vision-Language Model for Visual Understanding
Figure 2 for EVLM: An Efficient Vision-Language Model for Visual Understanding
Figure 3 for EVLM: An Efficient Vision-Language Model for Visual Understanding
Figure 4 for EVLM: An Efficient Vision-Language Model for Visual Understanding
Viaarxiv icon