Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tong Wu

HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance

Apr 08, 2025

Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang

Abstract:Text-to-image (T2I) diffusion/flow models have drawn considerable attention recently due to their remarkable ability to deliver flexible visual creations. Still, high-resolution image synthesis presents formidable challenges due to the scarcity and complexity of high-resolution content. To this end, we present HiFlow, a training-free and model-agnostic framework to unlock the resolution potential of pre-trained flow models. Specifically, HiFlow establishes a virtual reference flow within the high-resolution space that effectively captures the characteristics of low-resolution flow information, offering guidance for high-resolution generation through three key aspects: initialization alignment for low-frequency consistency, direction alignment for structure preservation, and acceleration alignment for detail fidelity. By leveraging this flow-aligned guidance, HiFlow substantially elevates the quality of high-resolution image synthesis of T2I models and demonstrates versatility across their personalized variants. Extensive experiments validate HiFlow's superiority in achieving superior high-resolution image quality over current state-of-the-art methods.

Via

Access Paper or Ask Questions

Effectively Controlling Reasoning Models through Thinking Intervention

Mar 31, 2025

Tong Wu, Chong Xiang, Jiachen T. Wang, Prateek Mittal

Figure 1 for Effectively Controlling Reasoning Models through Thinking Intervention

Figure 2 for Effectively Controlling Reasoning Models through Thinking Intervention

Figure 3 for Effectively Controlling Reasoning Models through Thinking Intervention

Figure 4 for Effectively Controlling Reasoning Models through Thinking Intervention

Abstract:Reasoning-enhanced large language models (LLMs) explicitly generate intermediate reasoning steps prior to generating final answers, helping the model excel in complex problem-solving. In this paper, we demonstrate that this emerging generation framework offers a unique opportunity for more fine-grained control over model behavior. We propose Thinking Intervention, a novel paradigm designed to explicitly guide the internal reasoning processes of LLMs by strategically inserting or revising specific thinking tokens. We conduct comprehensive evaluations across multiple tasks, including instruction following on IFEval, instruction hierarchy on SEP, and safety alignment on XSTest and SORRY-Bench. Our results demonstrate that Thinking Intervention significantly outperforms baseline prompting approaches, achieving up to 6.7% accuracy gains in instruction-following scenarios, 15.4% improvements in reasoning about instruction hierarchies, and a 40.0% increase in refusal rates for unsafe prompts using open-source DeepSeek R1 models. Overall, our work opens a promising new research avenue for controlling reasoning LLMs.

Via

Access Paper or Ask Questions

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

Mar 29, 2025

Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, Zilong Zheng

Abstract:The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.

* To appear at CVPR 2025

Via

Access Paper or Ask Questions

3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models

Mar 27, 2025

Yuhan Zhang, Mengchen Zhang, Tong Wu, Tengfei Wang, Gordon Wetzstein, Dahua Lin, Ziwei Liu

Figure 1 for 3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models

Figure 2 for 3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models

Figure 3 for 3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models

Figure 4 for 3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models

Abstract:3D generation is experiencing rapid advancements, while the development of 3D evaluation has not kept pace. How to keep automatic evaluation equitably aligned with human perception has become a well-recognized challenge. Recent advances in the field of language and image generation have explored human preferences and showcased respectable fitting ability. However, the 3D domain still lacks such a comprehensive preference dataset over generative models. To mitigate this absence, we develop 3DGen-Arena, an integrated platform in a battle manner. Then, we carefully design diverse text and image prompts and leverage the arena platform to gather human preferences from both public users and expert annotators, resulting in a large-scale multi-dimension human preference dataset 3DGen-Bench. Using this dataset, we further train a CLIP-based scoring model, 3DGen-Score, and a MLLM-based automatic evaluator, 3DGen-Eval. These two models innovatively unify the quality evaluation of text-to-3D and image-to-3D generation, and jointly form our automated evaluation system with their respective strengths. Extensive experiments demonstrate the efficacy of our scoring model in predicting human preferences, exhibiting a superior correlation with human ranks compared to existing metrics. We believe that our 3DGen-Bench dataset and automated evaluation system will foster a more equitable evaluation in the field of 3D generation, further promoting the development of 3D generative models and their downstream applications.

Via

Access Paper or Ask Questions

UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation

Feb 28, 2025

Thanet Markchom, Tong Wu, Liting Huang, Huizhi Liang

Figure 1 for UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation

Figure 2 for UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation

Figure 3 for UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation

Figure 4 for UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation

Abstract:SemEval-2025 Task 1 focuses on ranking images based on their alignment with a given nominal compound that may carry idiomatic meaning in both English and Brazilian Portuguese. To address this challenge, this work uses generative large language models (LLMs) and multilingual CLIP models to enhance idiomatic compound representations. LLMs generate idiomatic meanings for potentially idiomatic compounds, enriching their semantic interpretation. These meanings are then encoded using multilingual CLIP models, serving as representations for image ranking. Contrastive learning and data augmentation techniques are applied to fine-tune these embeddings for improved performance. Experimental results show that multimodal representations extracted through this method outperformed those based solely on the original nominal compounds. The fine-tuning approach shows promising outcomes but is less effective than using embeddings without fine-tuning. The source code used in this paper is available at https://github.com/tongwu17/SemEval-2025-Task1-UoR-NCL.

Via

Access Paper or Ask Questions

From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens

Feb 26, 2025

Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng

Abstract:Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at https://github.com/bigai-nlco/TokenSwift.

Via

Access Paper or Ask Questions

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Feb 12, 2025

Yujie Zhou, Jiazi Bu, Pengyang Ling, Pan Zhang, Tong Wu, Qidong Huang, Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao(+3 more)

Figure 1 for Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Figure 2 for Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Figure 3 for Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Figure 4 for Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Abstract:Recent advancements in image relighting models, driven by large-scale datasets and pre-trained diffusion models, have enabled the imposition of consistent lighting. However, video relighting still lags, primarily due to the excessive training costs and the scarcity of diverse, high-quality video relighting datasets. A simple application of image relighting models on a frame-by-frame basis leads to several issues: lighting source inconsistency and relighted appearance inconsistency, resulting in flickers in the generated videos. In this work, we propose Light-A-Video, a training-free approach to achieve temporally smooth video relighting. Adapted from image relighting models, Light-A-Video introduces two key techniques to enhance lighting consistency. First, we design a Consistent Light Attention (CLA) module, which enhances cross-frame interactions within the self-attention layers to stabilize the generation of the background lighting source. Second, leveraging the physical principle of light transport independence, we apply linear blending between the source video's appearance and the relighted appearance, using a Progressive Light Fusion (PLF) strategy to ensure smooth temporal transitions in illumination. Experiments show that Light-A-Video improves the temporal consistency of relighted video while maintaining the image quality, ensuring coherent lighting transitions across frames. Project page: https://bujiazi.github.io/light-a-video.github.io/.

* Project Page: https://bujiazi.github.io/light-a-video.github.io/

Via

Access Paper or Ask Questions

RelightVid: Temporal-Consistent Diffusion Model for Video Relighting

Jan 27, 2025

Ye Fang, Zeyi Sun, Shangzhan Zhang, Tong Wu, Yinghao Xu, Pan Zhang, Jiaqi Wang, Gordon Wetzstein, Dahua Lin

Abstract:Diffusion models have demonstrated remarkable success in image generation and editing, with recent advancements enabling albedo-preserving image relighting. However, applying these models to video relighting remains challenging due to the lack of paired video relighting datasets and the high demands for output fidelity and temporal consistency, further complicated by the inherent randomness of diffusion models. To address these challenges, we introduce RelightVid, a flexible framework for video relighting that can accept background video, text prompts, or environment maps as relighting conditions. Trained on in-the-wild videos with carefully designed illumination augmentations and rendered videos under extreme dynamic lighting, RelightVid achieves arbitrary video relighting with high temporal consistency without intrinsic decomposition while preserving the illumination priors of its image backbone.

Via

Access Paper or Ask Questions

Geometric Mean Improves Loss For Few-Shot Learning

Jan 24, 2025

Tong Wu, Takumi Kobayashi

Figure 1 for Geometric Mean Improves Loss For Few-Shot Learning

Figure 2 for Geometric Mean Improves Loss For Few-Shot Learning

Figure 3 for Geometric Mean Improves Loss For Few-Shot Learning

Figure 4 for Geometric Mean Improves Loss For Few-Shot Learning

Abstract:Few-shot learning (FSL) is a challenging task in machine learning, demanding a model to render discriminative classification by using only a few labeled samples. In the literature of FSL, deep models are trained in a manner of metric learning to provide metric in a feature space which is well generalizable to classify samples of novel classes; in the space, even a few amount of labeled training examples can construct an effective classifier. In this paper, we propose a novel FSL loss based on \emph{geometric mean} to embed discriminative metric into deep features. In contrast to the other losses such as utilizing arithmetic mean in softmax-based formulation, the proposed method leverages geometric mean to aggregate pair-wise relationships among samples for enhancing discriminative metric across class categories. The proposed loss is not only formulated in a simple form but also is thoroughly analyzed in theoretical ways to reveal its favorable characteristics which are favorable for learning feature metric in FSL. In the experiments on few-shot image classification tasks, the method produces competitive performance in comparison to the other losses.

Via

Access Paper or Ask Questions

Diffusion Model-Based Data Synthesis Aided Federated Semi-Supervised Learning

Jan 04, 2025

Zhongwei Wang, Tong Wu, Zhiyong Chen, Liang Qian, Yin Xu, Meixia Tao

Figure 1 for Diffusion Model-Based Data Synthesis Aided Federated Semi-Supervised Learning

Figure 2 for Diffusion Model-Based Data Synthesis Aided Federated Semi-Supervised Learning

Figure 3 for Diffusion Model-Based Data Synthesis Aided Federated Semi-Supervised Learning

Figure 4 for Diffusion Model-Based Data Synthesis Aided Federated Semi-Supervised Learning

Abstract:Federated semi-supervised learning (FSSL) is primarily challenged by two factors: the scarcity of labeled data across clients and the non-independent and identically distribution (non-IID) nature of data among clients. In this paper, we propose a novel approach, diffusion model-based data synthesis aided FSSL (DDSA-FSSL), which utilizes a diffusion model (DM) to generate synthetic data, bridging the gap between heterogeneous local data distributions and the global data distribution. In DDSA-FSSL, clients address the challenge of the scarcity of labeled data by employing a federated learning-trained classifier to perform pseudo labeling for unlabeled data. The DM is then collaboratively trained using both labeled and precision-optimized pseudo-labeled data, enabling clients to generate synthetic samples for classes that are absent in their labeled datasets. This process allows clients to generate more comprehensive synthetic datasets aligned with the global distribution. Extensive experiments conducted on multiple datasets and varying non-IID distributions demonstrate the effectiveness of DDSA-FSSL, e.g., it improves accuracy from 38.46% to 52.14% on CIFAR-10 datasets with 10% labeled data.

* accepted by IEEE WCNC 2025

Via

Access Paper or Ask Questions