Alert button
Picture for Fei Dai

Fei Dai

Alert button

WRHT: Efficient All-reduce for Distributed DNN Training in Optical Interconnect System

Jul 22, 2022
Fei Dai, Yawen Chen, Zhiyi Huang, Haibo Zhang, Fangfang Zhang

Figure 1 for WRHT: Efficient All-reduce for Distributed DNN Training in Optical Interconnect System
Figure 2 for WRHT: Efficient All-reduce for Distributed DNN Training in Optical Interconnect System
Figure 3 for WRHT: Efficient All-reduce for Distributed DNN Training in Optical Interconnect System
Figure 4 for WRHT: Efficient All-reduce for Distributed DNN Training in Optical Interconnect System

Communication efficiency plays an important role in accelerating the distributed training of Deep Neural Networks (DNN). All-reduce is the key communication primitive to reduce model parameters in distributed DNN training. Most existing all-reduce algorithms are designed for traditional electrical interconnect systems, which cannot meet the communication requirements for distributed training of large DNNs. One of the promising alternatives for electrical interconnect is optical interconnect, which can provide high bandwidth, low transmission delay, and low power cost. We propose an efficient scheme called WRHT (Wavelength Reused Hierarchical Tree) for implementing all-reduce operation in optical interconnect system, which can take advantage of WDM (Wavelength Division Multiplexing) to reduce the communication time of distributed data-parallel DNN training. We further derive the minimum number of communication steps and communication time to realize the all-reduce using WRHT. Simulation results show that the communication time of WRHT is reduced by 75.59%, 49.25%, and 70.1% respectively compared with three traditional all-reduce algorithms simulated in optical interconnect system. Simulation results also show that WRHT can reduce the communication time for all-reduce operation by 86.69% and 84.71% in comparison with two existing all-reduce algorithms in electrical interconnect system.

* This paper is under the submission of GLOBECOM 2022 
Viaarxiv icon

Accelerating Fully Connected Neural Network on Optical Network-on-Chip (ONoC)

Sep 30, 2021
Fei Dai, Yawen Chen, Haibo Zhang, Zhiyi Huang

Figure 1 for Accelerating Fully Connected Neural Network on Optical Network-on-Chip (ONoC)
Figure 2 for Accelerating Fully Connected Neural Network on Optical Network-on-Chip (ONoC)
Figure 3 for Accelerating Fully Connected Neural Network on Optical Network-on-Chip (ONoC)
Figure 4 for Accelerating Fully Connected Neural Network on Optical Network-on-Chip (ONoC)

Fully Connected Neural Network (FCNN) is a class of Artificial Neural Networks widely used in computer science and engineering, whereas the training process can take a long time with large datasets in existing many-core systems. Optical Network-on-Chip (ONoC), an emerging chip-scale optical interconnection technology, has great potential to accelerate the training of FCNN with low transmission delay, low power consumption, and high throughput. However, existing methods based on Electrical Network-on-Chip (ENoC) cannot fit in ONoC because of the unique properties of ONoC. In this paper, we propose a fine-grained parallel computing model for accelerating FCNN training on ONoC and derive the optimal number of cores for each execution stage with the objective of minimizing the total amount of time to complete one epoch of FCNN training. To allocate the optimal number of cores for each execution stage, we present three mapping strategies and compare their advantages and disadvantages in terms of hotspot level, memory requirement, and state transitions. Simulation results show that the average prediction error for the optimal number of cores in NN benchmarks is within 2.3%. We further carry out extensive simulations which demonstrate that FCNN training time can be reduced by 22.28% and 4.91% on average using our proposed scheme, compared with traditional parallel computing methods that either allocate a fixed number of cores or allocate as many cores as possible, respectively. Compared with ENoC, simulation results show that under batch sizes of 64 and 128, on average ONoC can achieve 21.02% and 12.95% on reducing training time with 47.85% and 39.27% on saving energy, respectively.

* 14 pages, 10 figures. This paper is under the second review of IEEE Transactions of Computers 
Viaarxiv icon