Picture for Rameswar Panda

Rameswar Panda

Richard

Variable-Width Transformers

Add code
Jun 16, 2026
Viaarxiv icon

CodeAlchemy: Synthetic Code Rewriting at Scale

Add code
Jun 08, 2026
Viaarxiv icon

Dynamic Short Convolutions Improve Transformers

Add code
Jun 02, 2026
Viaarxiv icon

PRISM: Demystifying Retention and Interaction in Mid-Training

Add code
Mar 17, 2026
Viaarxiv icon

Distilling to Hybrid Attention Models via KL-Guided Layer Selection

Add code
Dec 23, 2025
Viaarxiv icon

FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference

Add code
May 28, 2025
Viaarxiv icon

PaTH Attention: Position Encoding via Accumulating Householder Transformations

Add code
May 22, 2025
Viaarxiv icon

Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning

Add code
Apr 04, 2025
Figure 1 for Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning
Figure 2 for Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning
Figure 3 for Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning
Figure 4 for Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning
Viaarxiv icon

Stick-breaking Attention

Add code
Oct 23, 2024
Figure 1 for Stick-breaking Attention
Figure 2 for Stick-breaking Attention
Figure 3 for Stick-breaking Attention
Figure 4 for Stick-breaking Attention
Viaarxiv icon

Calibrating Expressions of Certainty

Add code
Oct 06, 2024
Figure 1 for Calibrating Expressions of Certainty
Figure 2 for Calibrating Expressions of Certainty
Figure 3 for Calibrating Expressions of Certainty
Figure 4 for Calibrating Expressions of Certainty
Viaarxiv icon