Picture for Zuxuan Wu

Zuxuan Wu

Fudan University

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

Add code
Dec 24, 2024
Figure 1 for VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
Figure 2 for VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
Figure 3 for VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
Figure 4 for VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
Viaarxiv icon

Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection

Add code
Dec 23, 2024
Figure 1 for Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection
Figure 2 for Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection
Figure 3 for Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection
Figure 4 for Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection
Viaarxiv icon

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Add code
Dec 05, 2024
Figure 1 for CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation
Figure 2 for CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation
Figure 3 for CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation
Figure 4 for CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation
Viaarxiv icon

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Add code
Dec 04, 2024
Figure 1 for Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
Figure 2 for Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
Figure 3 for Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
Figure 4 for Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
Viaarxiv icon

ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection

Add code
Nov 29, 2024
Figure 1 for ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection
Figure 2 for ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection
Figure 3 for ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection
Figure 4 for ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection
Viaarxiv icon

StableAnimator: High-Quality Identity-Preserving Human Image Animation

Add code
Nov 26, 2024
Viaarxiv icon

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision

Add code
Nov 25, 2024
Figure 1 for Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision
Figure 2 for Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision
Figure 3 for Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision
Figure 4 for Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision
Viaarxiv icon

REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents

Add code
Nov 20, 2024
Figure 1 for REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents
Figure 2 for REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents
Figure 3 for REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents
Figure 4 for REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents
Viaarxiv icon

Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders

Add code
Oct 27, 2024
Figure 1 for Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
Figure 2 for Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
Figure 3 for Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
Figure 4 for Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
Viaarxiv icon

DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation

Add code
Sep 11, 2024
Figure 1 for DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation
Figure 2 for DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation
Figure 3 for DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation
Figure 4 for DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation
Viaarxiv icon