Picture for Rongyao Fang

Rongyao Fang

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

Add code
Sep 11, 2025
Viaarxiv icon

T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation

Add code
Aug 24, 2025
Viaarxiv icon

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

Add code
May 22, 2025
Viaarxiv icon

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Add code
May 22, 2025
Viaarxiv icon

SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

Add code
May 22, 2025
Viaarxiv icon

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Add code
Mar 13, 2025
Viaarxiv icon

StreamChat: Chatting with Streaming Video

Add code
Dec 11, 2024
Figure 1 for StreamChat: Chatting with Streaming Video
Figure 2 for StreamChat: Chatting with Streaming Video
Figure 3 for StreamChat: Chatting with Streaming Video
Figure 4 for StreamChat: Chatting with Streaming Video
Viaarxiv icon

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Add code
Oct 17, 2024
Figure 1 for PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Figure 2 for PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Figure 3 for PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Figure 4 for PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Viaarxiv icon

FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis

Add code
Mar 19, 2024
Viaarxiv icon

InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation

Add code
Nov 30, 2023
Figure 1 for InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation
Figure 2 for InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation
Figure 3 for InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation
Figure 4 for InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation
Viaarxiv icon