Alert button
Picture for Bor-Yiing Su

Bor-Yiing Su

Alert button

CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery

Add code
Bookmark button
Alert button
Nov 05, 2020
Kiwan Maeng, Shivam Bharuka, Isabel Gao, Mark C. Jeffrey, Vikram Saraph, Bor-Yiing Su, Caroline Trippel, Jiyan Yang, Mike Rabbat, Brandon Lucia, Carole-Jean Wu

Figure 1 for CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
Figure 2 for CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
Figure 3 for CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
Figure 4 for CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
Viaarxiv icon

ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training

Add code
Bookmark button
Alert button
Mar 07, 2020
Qinqing Zheng, Bor-Yiing Su, Jiyan Yang, Alisson Azzolini, Qiang Wu, Ou Jin, Shri Karandikar, Hagay Lupesko, Liang Xiong, Eric Zhou

Figure 1 for ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training
Figure 2 for ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training
Figure 3 for ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training
Figure 4 for ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training
Viaarxiv icon