Alert button
Picture for Mark C. Jeffrey

Mark C. Jeffrey

Alert button

CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery

Add code
Bookmark button
Alert button
Nov 05, 2020
Kiwan Maeng, Shivam Bharuka, Isabel Gao, Mark C. Jeffrey, Vikram Saraph, Bor-Yiing Su, Caroline Trippel, Jiyan Yang, Mike Rabbat, Brandon Lucia, Carole-Jean Wu

Figure 1 for CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
Figure 2 for CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
Figure 3 for CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
Figure 4 for CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
Viaarxiv icon