Picture for Zac Hatfield-Dodds

Zac Hatfield-Dodds

Toy Models of Superposition

Add code
Sep 21, 2022
Viaarxiv icon

Scaling Laws and Interpretability of Learning from Repeated Data

Add code
May 21, 2022
Figure 1 for Scaling Laws and Interpretability of Learning from Repeated Data
Figure 2 for Scaling Laws and Interpretability of Learning from Repeated Data
Figure 3 for Scaling Laws and Interpretability of Learning from Repeated Data
Figure 4 for Scaling Laws and Interpretability of Learning from Repeated Data
Viaarxiv icon

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Add code
Apr 12, 2022
Figure 1 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Figure 2 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Figure 3 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Figure 4 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Viaarxiv icon

A General Language Assistant as a Laboratory for Alignment

Add code
Dec 09, 2021
Figure 1 for A General Language Assistant as a Laboratory for Alignment
Figure 2 for A General Language Assistant as a Laboratory for Alignment
Figure 3 for A General Language Assistant as a Laboratory for Alignment
Figure 4 for A General Language Assistant as a Laboratory for Alignment
Viaarxiv icon