Picture for Mingze Wang

Mingze Wang

On the Learning Dynamics of Two-layer Linear Networks with Label Noise SGD

Add code
Mar 11, 2026
Viaarxiv icon

Accelerating LLM Pre-Training through Flat-Direction Dynamics Enhancement

Add code
Feb 26, 2026
Viaarxiv icon

Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws

Add code
Feb 15, 2026
Viaarxiv icon

A Single Merging Suffices: Recovering Server-based Learning Performance in Decentralized Learning

Add code
Jul 09, 2025
Viaarxiv icon

GradPower: Powering Gradients for Faster Language Model Pre-Training

Add code
May 30, 2025
Figure 1 for GradPower: Powering Gradients for Faster Language Model Pre-Training
Figure 2 for GradPower: Powering Gradients for Faster Language Model Pre-Training
Figure 3 for GradPower: Powering Gradients for Faster Language Model Pre-Training
Figure 4 for GradPower: Powering Gradients for Faster Language Model Pre-Training
Viaarxiv icon

On the Expressive Power of Mixture-of-Experts for Structured Complex Tasks

Add code
May 30, 2025
Viaarxiv icon

The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Add code
Feb 26, 2025
Viaarxiv icon

CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset

Add code
Nov 18, 2024
Figure 1 for CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset
Figure 2 for CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset
Figure 3 for CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset
Figure 4 for CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset
Viaarxiv icon

How Transformers Implement Induction Heads: Approximation and Optimization Analysis

Add code
Oct 15, 2024
Figure 1 for How Transformers Implement Induction Heads: Approximation and Optimization Analysis
Viaarxiv icon

Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training

Add code
Oct 14, 2024
Figure 1 for Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training
Figure 2 for Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training
Figure 3 for Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training
Figure 4 for Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training
Viaarxiv icon