Picture for Shengqiong Wu

Shengqiong Wu

On Path to Multimodal Generalist: General-Level and General-Bench

Add code
May 07, 2025
Viaarxiv icon

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Add code
Apr 17, 2025
Viaarxiv icon

Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

Add code
Mar 31, 2025
Viaarxiv icon

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

Add code
Mar 30, 2025
Viaarxiv icon

Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene

Add code
Mar 19, 2025
Figure 1 for Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene
Figure 2 for Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene
Figure 3 for Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene
Figure 4 for Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene
Viaarxiv icon

Universal Scene Graph Generation

Add code
Mar 19, 2025
Figure 1 for Universal Scene Graph Generation
Figure 2 for Universal Scene Graph Generation
Figure 3 for Universal Scene Graph Generation
Figure 4 for Universal Scene Graph Generation
Viaarxiv icon

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Add code
Mar 16, 2025
Viaarxiv icon

Combating Multimodal LLM Hallucination via Bottom-up Holistic Reasoning

Add code
Dec 15, 2024
Figure 1 for Combating Multimodal LLM Hallucination via Bottom-up Holistic Reasoning
Figure 2 for Combating Multimodal LLM Hallucination via Bottom-up Holistic Reasoning
Figure 3 for Combating Multimodal LLM Hallucination via Bottom-up Holistic Reasoning
Figure 4 for Combating Multimodal LLM Hallucination via Bottom-up Holistic Reasoning
Viaarxiv icon

PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis

Add code
Aug 18, 2024
Figure 1 for PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis
Figure 2 for PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis
Figure 3 for PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis
Figure 4 for PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis
Viaarxiv icon

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Add code
Jun 27, 2024
Figure 1 for OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Figure 2 for OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Figure 3 for OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Figure 4 for OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Viaarxiv icon