Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yida Wang

Capital Normal University, Infinigence-AI

TASP: Topology-aware Sequence Parallelism

Sep 30, 2025

Yida Wang, Ke Hong, Xiuhong Li, Yuanchao Xu, Wenxun Wang, Guohao Dai, Yu Wang

Abstract:Long-context large language models (LLMs) face constraints due to the quadratic complexity of the self-attention mechanism. The mainstream sequence parallelism (SP) method, Ring Attention, attempts to solve this by distributing the query into multiple query chunks across accelerators and enable each Q tensor to access all KV tensors from other accelerators via the Ring AllGather communication primitive. However, it exhibits low communication efficiency, restricting its practical applicability. This inefficiency stems from the mismatch between the Ring AllGather communication primitive it adopts and the AlltoAll topology of modern accelerators. A Ring AllGather primitive is composed of iterations of ring-styled data transfer, which can only utilize a very limited fraction of an AlltoAll topology. Inspired by the Hamiltonian decomposition of complete directed graphs, we identify that modern accelerator topology can be decomposed into multiple orthogonal ring datapaths which can concurrently transfer data without interference. Based on this, we further observe that the Ring AllGather primitive can also be decomposed into the same number of concurrent ring-styled data transfer at every iteration. Based on these insights, we propose TASP, a topology-aware SP method for long-context LLMs that fully utilizes the communication capacity of modern accelerators via topology decomposition and primitive decomposition. Experimental results on both single-node and multi-node NVIDIA H100 systems and a single-node AMD MI300X system demonstrate that TASP achieves higher communication efficiency than Ring Attention on these modern accelerator topologies and achieves up to 3.58 speedup than Ring Attention and its variant Zigzag-Ring Attention. The code is available at https://github.com/infinigence/HamiltonAttention.

Via

Access Paper or Ask Questions

A Physics-Driven Neural Network with Parameter Embedding for Generating Quantitative MR Maps from Weighted Images

Aug 11, 2025

Lingjing Chen, Chengxiu Zhang, Yinqiao Yi, Yida Wang, Yang Song, Xu Yan, Shengfang Xu, Dalin Zhu, Mengqiu Cao, Yan Zhou(+2 more)

Abstract:We propose a deep learning-based approach that integrates MRI sequence parameters to improve the accuracy and generalizability of quantitative image synthesis from clinical weighted MRI. Our physics-driven neural network embeds MRI sequence parameters -- repetition time (TR), echo time (TE), and inversion time (TI) -- directly into the model via parameter embedding, enabling the network to learn the underlying physical principles of MRI signal formation. The model takes conventional T1-weighted, T2-weighted, and T2-FLAIR images as input and synthesizes T1, T2, and proton density (PD) quantitative maps. Trained on healthy brain MR images, it was evaluated on both internal and external test datasets. The proposed method achieved high performance with PSNR values exceeding 34 dB and SSIM values above 0.92 for all synthesized parameter maps. It outperformed conventional deep learning models in accuracy and robustness, including data with previously unseen brain structures and lesions. Notably, our model accurately synthesized quantitative maps for these unseen pathological regions, highlighting its superior generalization capability. Incorporating MRI sequence parameters via parameter embedding allows the neural network to better learn the physical characteristics of MR signals, significantly enhancing the performance and reliability of quantitative MRI synthesis. This method shows great potential for accelerating qMRI and improving its clinical utility.

Via

Access Paper or Ask Questions

TTrace: Lightweight Error Checking and Diagnosis for Distributed Training

Jun 10, 2025

Haitian Jiang, Shaowei Zhu, Zhen Zhang, Zhenyu Song, Xinwei Fu, Zhen Jia, Yida Wang, Jinyang Li

Abstract:Distributed training is essential for scaling the training of large neural network models, such as large language models (LLMs), across thousands of GPUs. However, the complexity of distributed training programs makes them particularly prone to silent bugs, which do not produce explicit error signal but lead to incorrect training outcome. Effectively detecting and localizing such silent bugs in distributed training is challenging. Common debugging practice using metrics like training loss or gradient norm curves can be inefficient and ineffective. Additionally, obtaining intermediate tensor values and determining whether they are correct during silent bug localization is difficult, particularly in the context of low-precision training. To address those challenges, we design and implement TTrace, the first system capable of detecting and localizing silent bugs in distributed training. TTrace collects intermediate tensors from distributing training in a fine-grained manner and compares them against those from a trusted single-device reference implementation. To properly compare the floating-point values in the tensors, we propose novel mathematical analysis that provides a guideline for setting thresholds, enabling TTrace to distinguish bug-induced errors from floating-point round-off errors. Experimental results demonstrate that TTrace effectively detects 11 existing bugs and 3 new bugs in the widely used Megatron-LM framework, while requiring fewer than 10 lines of code change. TTrace is effective in various training recipes, including low-precision recipes involving BF16 and FP8.

Via

Access Paper or Ask Questions

GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control

May 29, 2025

Anthony Chen, Wenzhao Zheng, Yida Wang, Xueyang Zhang, Kun Zhan, Peng Jia, Kurt Keutzer, Shanghang Zhang

Abstract:Recent advancements in world models have revolutionized dynamic environment simulation, allowing systems to foresee future states and assess potential actions. In autonomous driving, these capabilities help vehicles anticipate the behavior of other road users, perform risk-aware planning, accelerate training in simulation, and adapt to novel scenarios, thereby enhancing safety and reliability. Current approaches exhibit deficiencies in maintaining robust 3D geometric consistency or accumulating artifacts during occlusion handling, both critical for reliable safety assessment in autonomous navigation tasks. To address this, we introduce GeoDrive, which explicitly integrates robust 3D geometry conditions into driving world models to enhance spatial understanding and action controllability. Specifically, we first extract a 3D representation from the input frame and then obtain its 2D rendering based on the user-specified ego-car trajectory. To enable dynamic modeling, we propose a dynamic editing module during training to enhance the renderings by editing the positions of the vehicles. Extensive experiments demonstrate that our method significantly outperforms existing models in both action accuracy and 3D spatial awareness, leading to more realistic, adaptable, and reliable scene modeling for safer autonomous driving. Additionally, our model can generalize to novel trajectories and offers interactive scene editing capabilities, such as object editing and object trajectory control.

* code will be released at https://github.com/antonioo-c/GeoDrive

Via

Access Paper or Ask Questions

Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

Apr 25, 2025

Yaoyao Ding, Bohan Hou, Xiao Zhang, Allan Lin, Tianqi Chen, Cody Yu Hao, Yida Wang, Gennady Pekhimenko

Figure 1 for Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

Figure 2 for Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

Figure 3 for Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

Figure 4 for Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

Abstract:Serving Large Language Models (LLMs) is critical for AI-powered applications but demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption. Existing approaches for generating low-precision kernels are limited to weight bit widths that are powers of two and suffer from suboptimal performance due to high-level GPU programming abstractions. These abstractions restrict critical optimizations, such as fine-grained register management and optimized memory access patterns, which are essential for efficient low-precision computations. In this paper, we introduce a virtual machine (VM) designed for General-Purpose GPU (GPGPU) computing, enabling support for low-precision data types with arbitrary bit widths while maintaining GPU programmability. The proposed VM features a thread-block-level programming model, a hierarchical memory space, a novel algebraic layout system, and extensive support for diverse low-precision data types. VM programs are compiled into highly efficient GPU programs with automatic vectorization and instruction selection. Extensive experiments demonstrate that our VM efficiently supports a full spectrum of low-precision data types, and outperforms state-of-the-art low-precision kernels on their supported types. Compared to existing compilers like Triton and Ladder, as well as hand-optimized kernels such as QuantLLM and Marlin, our VM achieves performance improvements of 1.75x, 2.61x, 1.29x and 1.03x, respectively.

* 18 pages, 15 figures

Via

Access Paper or Ask Questions

A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

Apr 17, 2025

Yaoyao Ding, Bohan Hou, Xiao Zhang, Allan Lin, Tianqi Chen, Cody Yu Hao, Yida Wang, Gennady Pekhimenko

Figure 1 for A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

Figure 2 for A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

Figure 3 for A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

Figure 4 for A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

* 18 pages, 15 figures

Via

Access Paper or Ask Questions

StyledStreets: Multi-style Street Simulator with Spatial and Temporal Consistency

Mar 27, 2025

Yuyin Chen, Yida Wang, Xueyang Zhang, Kun Zhan, Peng Jia, Yifei Zhan, Xianpeng Lang

Figure 1 for StyledStreets: Multi-style Street Simulator with Spatial and Temporal Consistency

Figure 2 for StyledStreets: Multi-style Street Simulator with Spatial and Temporal Consistency

Figure 3 for StyledStreets: Multi-style Street Simulator with Spatial and Temporal Consistency

Figure 4 for StyledStreets: Multi-style Street Simulator with Spatial and Temporal Consistency

Abstract:Urban scene reconstruction requires modeling both static infrastructure and dynamic elements while supporting diverse environmental conditions. We present \textbf{StyledStreets}, a multi-style street simulator that achieves instruction-driven scene editing with guaranteed spatial and temporal consistency. Building on a state-of-the-art Gaussian Splatting framework for street scenarios enhanced by our proposed pose optimization and multi-view training, our method enables photorealistic style transfers across seasons, weather conditions, and camera setups through three key innovations: First, a hybrid embedding scheme disentangles persistent scene geometry from transient style attributes, allowing realistic environmental edits while preserving structural integrity. Second, uncertainty-aware rendering mitigates supervision noise from diffusion priors, enabling robust training across extreme style variations. Third, a unified parametric model prevents geometric drift through regularized updates, maintaining multi-view consistency across seven vehicle-mounted cameras. Our framework preserves the original scene's motion patterns and geometric relationships. Qualitative results demonstrate plausible transitions between diverse conditions (snow, sandstorm, night), while quantitative evaluations show state-of-the-art geometric accuracy under style transfers. The approach establishes new capabilities for urban simulation, with applications in autonomous vehicle testing and augmented reality systems requiring reliable environmental consistency. Codes will be publicly available upon publication.

* 14 pages

Via

Access Paper or Ask Questions

Movable Antenna Array Aided Ultra Reliable Covert Communications

Dec 29, 2024

Yida Wang, Guojie Hu, Xiaoling Hu, Xingbo Lu, Yuzhen Huang

Figure 1 for Movable Antenna Array Aided Ultra Reliable Covert Communications

Figure 2 for Movable Antenna Array Aided Ultra Reliable Covert Communications

Figure 3 for Movable Antenna Array Aided Ultra Reliable Covert Communications

Figure 4 for Movable Antenna Array Aided Ultra Reliable Covert Communications

Abstract:In this paper, we construct a framework of the movable antenna (MA) aided covert communication shielded by the general noise uncertainty for the first time. According to the analysis performance on the derived closed-form expressions of the sum of the probabilities of the detection errors and the communication outage probability, the perfect covertness and the ultra reliability can be achieved by adjusting the antenna position in the MA array. Then, we formulate the communication covertness maximization problem with the constraints of the ultra reliability and the independent discrete movable position to optimize the transmitter's parameter. With the maximal ratio transmitting (MRT) design for the beamforming, we solve the closed-form optimal information transmit power and design a lightweight discrete projected gradient descent (DPGD) algorithm to determine the optimal antenna position. The numerical results show that the optimal achievable covertness and the feasible region of the steering angle with the MA array is significant larger than the one with the fixed-position antenna (FPA) array.

* has been presented in IEEE GLOBECOM 2024

Via

Access Paper or Ask Questions

StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models

Dec 17, 2024

Yunzhi Yan, Zhen Xu, Haotong Lin, Haian Jin, Haoyu Guo, Yida Wang, Kun Zhan, Xianpeng Lang, Hujun Bao, Xiaowei Zhou(+1 more)

Figure 1 for StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models

Figure 2 for StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models

Figure 3 for StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models

Figure 4 for StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models

Abstract:This paper aims to tackle the problem of photorealistic view synthesis from vehicle sensor data. Recent advancements in neural scene representation have achieved notable success in rendering high-quality autonomous driving scenes, but the performance significantly degrades as the viewpoint deviates from the training trajectory. To mitigate this problem, we introduce StreetCrafter, a novel controllable video diffusion model that utilizes LiDAR point cloud renderings as pixel-level conditions, which fully exploits the generative prior for novel view synthesis, while preserving precise camera control. Moreover, the utilization of pixel-level LiDAR conditions allows us to make accurate pixel-level edits to target scenes. In addition, the generative prior of StreetCrafter can be effectively incorporated into dynamic scene representations to achieve real-time rendering. Experiments on Waymo Open Dataset and PandaSet demonstrate that our model enables flexible control over viewpoint changes, enlarging the view synthesis regions for satisfying rendering, which outperforms existing methods.

* Project page: https://zju3dv.github.io/street_crafter

Via

Access Paper or Ask Questions

ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration

Nov 29, 2024

Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin, Guan Huang, Chen Liu, Yuyin Chen, Yida Wang, Xueyang Zhang(+6 more)

Figure 1 for ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration

Figure 2 for ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration

Figure 3 for ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration

Figure 4 for ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration

Abstract:Closed-loop simulation is crucial for end-to-end autonomous driving. Existing sensor simulation methods (e.g., NeRF and 3DGS) reconstruct driving scenes based on conditions that closely mirror training data distributions. However, these methods struggle with rendering novel trajectories, such as lane changes. Recent works have demonstrated that integrating world model knowledge alleviates these issues. Despite their efficiency, these approaches still encounter difficulties in the accurate representation of more complex maneuvers, with multi-lane shifts being a notable example. Therefore, we introduce ReconDreamer, which enhances driving scene reconstruction through incremental integration of world model knowledge. Specifically, DriveRestorer is proposed to mitigate artifacts via online restoration. This is complemented by a progressive data update strategy designed to ensure high-quality rendering for more complex maneuvers. To the best of our knowledge, ReconDreamer is the first method to effectively render in large maneuvers. Experimental results demonstrate that ReconDreamer outperforms Street Gaussians in the NTA-IoU, NTL-IoU, and FID, with relative improvements by 24.87%, 6.72%, and 29.97%. Furthermore, ReconDreamer surpasses DriveDreamer4D with PVG during large maneuver rendering, as verified by a relative improvement of 195.87% in the NTA-IoU metric and a comprehensive user study.

* Project Page: https://recondreamer.github.io

Via

Access Paper or Ask Questions