Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Leyang Wang

Zero-Flow Encoders

Jan 31, 2026

Yakun Wang, Leyang Wang, Song Liu, Taiji Suzuki

Abstract:Flow-based methods have achieved significant success in various generative modeling tasks, capturing nuanced details within complex data distributions. However, few existing works have exploited this unique capability to resolve fine-grained structural details beyond generation tasks. This paper presents a flow-inspired framework for representation learning. First, we demonstrate that a rectified flow trained using independent coupling is zero everywhere at $t=0.5$ if and only if the source and target distributions are identical. We term this property the \emph{zero-flow criterion}. Second, we show that this criterion can certify conditional independence, thereby extracting \emph{sufficient information} from the data. Third, we translate this criterion into a tractable, simulation-free loss function that enables learning amortized Markov blankets in graphical models and latent representations in self-supervised learning tasks. Experiments on both simulated and real-world datasets demonstrate the effectiveness of our approach. The code reproducing our experiments can be found at: https://github.com/probabilityFLOW/zfe.

* Yakun Wang and Leyang Wang contributed equally to this work

Via

Access Paper or Ask Questions

VarDiU: A Variational Diffusive Upper Bound for One-Step Diffusion Distillation

Aug 28, 2025

Leyang Wang, Mingtian Zhang, Zijing Ou, David Barber

Figure 1 for VarDiU: A Variational Diffusive Upper Bound for One-Step Diffusion Distillation

Figure 2 for VarDiU: A Variational Diffusive Upper Bound for One-Step Diffusion Distillation

Figure 3 for VarDiU: A Variational Diffusive Upper Bound for One-Step Diffusion Distillation

Figure 4 for VarDiU: A Variational Diffusive Upper Bound for One-Step Diffusion Distillation

Abstract:Recently, diffusion distillation methods have compressed thousand-step teacher diffusion models into one-step student generators while preserving sample quality. Most existing approaches train the student model using a diffusive divergence whose gradient is approximated via the student's score function, learned through denoising score matching (DSM). Since DSM training is imperfect, the resulting gradient estimate is inevitably biased, leading to sub-optimal performance. In this paper, we propose VarDiU (pronounced /va:rdju:/), a Variational Diffusive Upper Bound that admits an unbiased gradient estimator and can be directly applied to diffusion distillation. Using this objective, we compare our method with Diff-Instruct and demonstrate that it achieves higher generation quality and enables a more efficient and stable training procedure for one-step diffusion distillation.

* Leyang Wang and Mingtian Zhang contributed equally to this work

Via

Access Paper or Ask Questions

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Jun 16, 2025

MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu(+118 more)

Figure 1 for MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Figure 2 for MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Figure 3 for MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Figure 4 for MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Abstract:We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.

* A technical report from MiniMax. The authors are listed in alphabetical order. We open-source our MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1

Via

Access Paper or Ask Questions

Missing Data Imputation by Reducing Mutual Information with Rectified Flows

May 16, 2025

Jiahao Yu, Qizhen Ying, Leyang Wang, Ziyue Jiang, Song Liu

Abstract:This paper introduces a novel iterative method for missing data imputation that sequentially reduces the mutual information between data and their corresponding missing mask. Inspired by GAN-based approaches, which train generators to decrease the predictability of missingness patterns, our method explicitly targets the reduction of mutual information. Specifically, our algorithm iteratively minimizes the KL divergence between the joint distribution of the imputed data and missing mask, and the product of their marginals from the previous iteration. We show that the optimal imputation under this framework corresponds to solving an ODE, whose velocity field minimizes a rectified flow training objective. We further illustrate that some existing imputation techniques can be interpreted as approximate special cases of our mutual-information-reducing framework. Comprehensive experiments on synthetic and real-world datasets validate the efficacy of our proposed approach, demonstrating superior imputation performance.

Via

Access Paper or Ask Questions

LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images

Mar 20, 2025

Leyang Wang, Joice Lin

Figure 1 for LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images

Figure 2 for LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images

Figure 3 for LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images

Figure 4 for LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images

Abstract:The success of modern machine learning, particularly in facial translation networks, is highly dependent on the availability of high-quality, paired, large-scale datasets. However, acquiring sufficient data is often challenging and costly. Inspired by the recent success of diffusion models in high-quality image synthesis and advancements in Large Language Models (LLMs), we propose a novel framework called LLM-assisted Paired Image Generation (LaPIG). This framework enables the construction of comprehensive, high-quality paired visible and thermal images using captions generated by LLMs. Our method encompasses three parts: visible image synthesis with ArcFace embedding, thermal image translation using Latent Diffusion Models (LDMs), and caption generation with LLMs. Our approach not only generates multi-view paired visible and thermal images to increase data diversity but also produces high-quality paired data while maintaining their identity information. We evaluate our method on public datasets by comparing it with existing methods, demonstrating the superiority of LaPIG.

Via

Access Paper or Ask Questions

Guiding Time-Varying Generative Models with Natural Gradients on Exponential Family Manifold

Feb 11, 2025

Song Liu, Leyang Wang, Yakun Wang

Figure 1 for Guiding Time-Varying Generative Models with Natural Gradients on Exponential Family Manifold

Figure 2 for Guiding Time-Varying Generative Models with Natural Gradients on Exponential Family Manifold

Figure 3 for Guiding Time-Varying Generative Models with Natural Gradients on Exponential Family Manifold

Figure 4 for Guiding Time-Varying Generative Models with Natural Gradients on Exponential Family Manifold

Abstract:Optimising probabilistic models is a well-studied field in statistics. However, its connection with the training of generative models remains largely under-explored. In this paper, we show that the evolution of time-varying generative models can be projected onto an exponential family manifold, naturally creating a link between the parameters of a generative model and those of a probabilistic model. We then train the generative model by moving its projection on the manifold according to the natural gradient descent scheme. This approach also allows us to approximate the natural gradient of the KL divergence efficiently without relying on MCMC for intractable models. Furthermore, we propose particle versions of the algorithm, which feature closed-form update rules for any parametric model within the exponential family. Through toy and real-world experiments, we validate the effectiveness of the proposed algorithms.

Via

Access Paper or Ask Questions

MiniMax-01: Scaling Foundation Models with Lightning Attention

Jan 14, 2025

MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen(+80 more)

Abstract:We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.

* A technical report from MiniMax. The authors are listed in alphabetical order. We open-sourced our MiniMax-01 at https://github.com/MiniMax-AI

Via

Access Paper or Ask Questions

High-Dimensional Differential Parameter Inference in Exponential Family using Time Score Matching

Oct 14, 2024

Daniel J. Williams, Leyang Wang, Qizhen Ying, Song Liu, Mladen Kolar

Figure 1 for High-Dimensional Differential Parameter Inference in Exponential Family using Time Score Matching

Figure 2 for High-Dimensional Differential Parameter Inference in Exponential Family using Time Score Matching

Figure 3 for High-Dimensional Differential Parameter Inference in Exponential Family using Time Score Matching

Figure 4 for High-Dimensional Differential Parameter Inference in Exponential Family using Time Score Matching

Abstract:This paper addresses differential inference in time-varying parametric probabilistic models, like graphical models with changing structures. Instead of estimating a high-dimensional model at each time and inferring changes later, we directly learn the differential parameter, i.e., the time derivative of the parameter. The main idea is treating the time score function of an exponential family model as a linear model of the differential parameter for direct estimation. We use time score matching to estimate parameter derivatives. We prove the consistency of a regularized score matching objective and demonstrate the finite-sample normality of a debiased estimator in high-dimensional settings. Our methodology effectively infers differential structures in high-dimensional graphical models, verified on simulated and real-world datasets.

* Daniel J. Williams and Leyang Wang contributed equally to this work

Via

Access Paper or Ask Questions