Picture for Rameswar Panda

Rameswar Panda

Richard

PRISM: Demystifying Retention and Interaction in Mid-Training

Add code
Mar 17, 2026
Viaarxiv icon

Distilling to Hybrid Attention Models via KL-Guided Layer Selection

Add code
Dec 23, 2025
Viaarxiv icon

FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference

Add code
May 28, 2025
Viaarxiv icon

PaTH Attention: Position Encoding via Accumulating Householder Transformations

Add code
May 22, 2025
Viaarxiv icon

Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning

Add code
Apr 04, 2025
Figure 1 for Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning
Figure 2 for Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning
Figure 3 for Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning
Figure 4 for Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning
Viaarxiv icon

Stick-breaking Attention

Add code
Oct 23, 2024
Figure 1 for Stick-breaking Attention
Figure 2 for Stick-breaking Attention
Figure 3 for Stick-breaking Attention
Figure 4 for Stick-breaking Attention
Viaarxiv icon

Calibrating Expressions of Certainty

Add code
Oct 06, 2024
Figure 1 for Calibrating Expressions of Certainty
Figure 2 for Calibrating Expressions of Certainty
Figure 3 for Calibrating Expressions of Certainty
Figure 4 for Calibrating Expressions of Certainty
Viaarxiv icon

SITAR: Semi-supervised Image Transformer for Action Recognition

Add code
Sep 04, 2024
Figure 1 for SITAR: Semi-supervised Image Transformer for Action Recognition
Figure 2 for SITAR: Semi-supervised Image Transformer for Action Recognition
Figure 3 for SITAR: Semi-supervised Image Transformer for Action Recognition
Figure 4 for SITAR: Semi-supervised Image Transformer for Action Recognition
Viaarxiv icon

Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

Add code
Aug 23, 2024
Figure 1 for Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler
Figure 2 for Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler
Figure 3 for Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler
Figure 4 for Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler
Viaarxiv icon

Scaling Granite Code Models to 128K Context

Add code
Jul 18, 2024
Viaarxiv icon