Picture for Zhenheng Yang

Zhenheng Yang

UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

Add code
May 29, 2025
Viaarxiv icon

DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling

Add code
May 16, 2025
Viaarxiv icon

Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

Add code
Apr 11, 2025
Viaarxiv icon

Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning

Add code
Mar 17, 2025
Viaarxiv icon

Long Context Tuning for Video Generation

Add code
Mar 13, 2025
Viaarxiv icon

UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths

Add code
Feb 10, 2025
Figure 1 for UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
Figure 2 for UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
Figure 3 for UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
Figure 4 for UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
Viaarxiv icon

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Add code
Jan 06, 2025
Viaarxiv icon

Parallelized Autoregressive Visual Generation

Add code
Dec 19, 2024
Figure 1 for Parallelized Autoregressive Visual Generation
Figure 2 for Parallelized Autoregressive Visual Generation
Figure 3 for Parallelized Autoregressive Visual Generation
Figure 4 for Parallelized Autoregressive Visual Generation
Viaarxiv icon

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Add code
Dec 12, 2024
Figure 1 for InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
Figure 2 for InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
Figure 3 for InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
Figure 4 for InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
Viaarxiv icon

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Add code
Aug 22, 2024
Figure 1 for Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Figure 2 for Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Figure 3 for Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Figure 4 for Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Viaarxiv icon