Picture for Joel Hestness

Joel Hestness

Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training

Add code
May 19, 2025
Viaarxiv icon

Don't be lazy: CompleteP enables compute-efficient deep transformers

Add code
May 02, 2025
Viaarxiv icon

Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs

Add code
Feb 21, 2025
Viaarxiv icon

Crystal: Illuminating LLM Abilities on Language and Code

Add code
Nov 06, 2024
Figure 1 for Crystal: Illuminating LLM Abilities on Language and Code
Figure 2 for Crystal: Illuminating LLM Abilities on Language and Code
Figure 3 for Crystal: Illuminating LLM Abilities on Language and Code
Figure 4 for Crystal: Illuminating LLM Abilities on Language and Code
Viaarxiv icon

Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers

Add code
Nov 01, 2024
Viaarxiv icon

Bilingual Adaptation of Monolingual Foundation Models

Add code
Jul 13, 2024
Figure 1 for Bilingual Adaptation of Monolingual Foundation Models
Figure 2 for Bilingual Adaptation of Monolingual Foundation Models
Figure 3 for Bilingual Adaptation of Monolingual Foundation Models
Figure 4 for Bilingual Adaptation of Monolingual Foundation Models
Viaarxiv icon

Sparse maximal update parameterization: A holistic approach to sparse training dynamics

Add code
May 24, 2024
Viaarxiv icon

MediSwift: Efficient Sparse Pre-trained Biomedical Language Models

Add code
Mar 01, 2024
Figure 1 for MediSwift: Efficient Sparse Pre-trained Biomedical Language Models
Figure 2 for MediSwift: Efficient Sparse Pre-trained Biomedical Language Models
Figure 3 for MediSwift: Efficient Sparse Pre-trained Biomedical Language Models
Figure 4 for MediSwift: Efficient Sparse Pre-trained Biomedical Language Models
Viaarxiv icon

Position Interpolation Improves ALiBi Extrapolation

Add code
Oct 18, 2023
Viaarxiv icon

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

Add code
Sep 20, 2023
Figure 1 for BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model
Figure 2 for BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model
Figure 3 for BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model
Figure 4 for BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model
Viaarxiv icon