Picture for Kaiyue Wen

Kaiyue Wen

PaTH Attention: Position Encoding via Accumulating Householder Transformations

Add code
May 22, 2025
Viaarxiv icon

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Add code
May 10, 2025
Viaarxiv icon

Weight Ensembling Improves Reasoning in Language Models

Add code
Apr 15, 2025
Viaarxiv icon

Overtrained Language Models Are Harder to Fine-Tune

Add code
Mar 24, 2025
Viaarxiv icon

Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images

Add code
Feb 19, 2025
Viaarxiv icon

Task Generalization With AutoRegressive Compositional Structure: Can Learning From $\d$ Tasks Generalize to $\d^{T}$ Tasks?

Add code
Feb 13, 2025
Viaarxiv icon

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

Add code
Jan 21, 2025
Viaarxiv icon

Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective

Add code
Oct 07, 2024
Figure 1 for Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective
Figure 2 for Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective
Figure 3 for Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective
Figure 4 for Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective
Viaarxiv icon

From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency

Add code
Oct 07, 2024
Figure 1 for From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency
Figure 2 for From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency
Figure 3 for From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency
Figure 4 for From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency
Viaarxiv icon

RNNs are not Transformers : The Key Bottleneck on In-context Retrieval

Add code
Feb 29, 2024
Viaarxiv icon