Picture for Longyin Wen

Longyin Wen

Vidi: Large Multimodal Models for Video Understanding and Editing

Add code
Apr 22, 2025
Viaarxiv icon

Where do Large Vision-Language Models Look at when Answering Questions?

Add code
Mar 18, 2025
Viaarxiv icon

Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models

Add code
Feb 04, 2025
Figure 1 for Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models
Figure 2 for Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models
Figure 3 for Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models
Figure 4 for Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models
Viaarxiv icon

Multi-Reward as Condition for Instruction-based Image Editing

Add code
Nov 06, 2024
Figure 1 for Multi-Reward as Condition for Instruction-based Image Editing
Viaarxiv icon

DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models

Add code
Nov 05, 2024
Figure 1 for DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models
Figure 2 for DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models
Figure 3 for DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models
Figure 4 for DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models
Viaarxiv icon

AIPO: Improving Training Objective for Iterative Preference Optimization

Add code
Sep 13, 2024
Figure 1 for AIPO: Improving Training Objective for Iterative Preference Optimization
Figure 2 for AIPO: Improving Training Objective for Iterative Preference Optimization
Figure 3 for AIPO: Improving Training Objective for Iterative Preference Optimization
Figure 4 for AIPO: Improving Training Objective for Iterative Preference Optimization
Viaarxiv icon

Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model

Add code
Jun 15, 2024
Figure 1 for Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model
Figure 2 for Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model
Figure 3 for Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model
Figure 4 for Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model
Viaarxiv icon

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

Add code
May 09, 2024
Figure 1 for CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Figure 2 for CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Figure 3 for CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Figure 4 for CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Viaarxiv icon

Edit3K: Universal Representation Learning for Video Editing Components

Add code
Mar 24, 2024
Viaarxiv icon

Accurate and Fast Compressed Video Captioning

Add code
Sep 22, 2023
Viaarxiv icon