Sequence Parallelism


Sequence parallelism is a memory-efficient parallelism method to help break input sequence length limitation and train with longer sequences on GPUs efficiently. Sequence parallelism extends tensor-level model parallelism by distributing computing load and activation memory across multiple GPUs along the sequence dimension of transformer layers. This method is particularly useful for portions of the layer that have previously not been parallelized, enhancing overall model performance and efficiency.

A Decomposition-based State Space Model for Multivariate Time-Series Forecasting

Add code
Feb 05, 2026
Viaarxiv icon

AgenticTagger: Structured Item Representation for Recommendation with LLM Agents

Add code
Feb 05, 2026
Viaarxiv icon

Beyond Tokens: Semantic-Aware Speculative Decoding for Efficient Inference by Probing Internal States

Add code
Feb 03, 2026
Viaarxiv icon

Sequential Group Composition: A Window into the Mechanics of Deep Learning

Add code
Feb 03, 2026
Viaarxiv icon

P-EAGLE: Parallel-Drafting EAGLE with Scalable Training

Add code
Feb 01, 2026
Viaarxiv icon

Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models

Add code
Feb 02, 2026
Viaarxiv icon

A Multi-scale Linear-time Encoder for Whole-Slide Image Analysis

Add code
Feb 02, 2026
Viaarxiv icon

CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling

Add code
Feb 02, 2026
Viaarxiv icon

Parallel Training in Spiking Neural Networks

Add code
Feb 01, 2026
Viaarxiv icon

Scalable Generative Game Engine: Breaking the Resolution Wall via Hardware-Algorithm Co-Design

Add code
Jan 31, 2026
Viaarxiv icon