There is a trend towards using very large deep neural networks (DNN) to improve the accuracy of complex machine learning tasks. However, the size of DNN models that can be explored today is limited by the amount of GPU device memory. This paper presents Tofu, a system for partitioning very large DNN models across multiple GPU devices. Tofu is designed for a tensor-based dataflow system: for each operator in the dataflow graph, it partitions its input/output tensors and parallelizes its execution across workers. Tofu can automatically discover how each operator can be partitioned by analyzing its semantics expressed in a simple specification language. Tofu uses a search algorithm based on dynamic programming to determine the best partition strategy for each operator in the entire dataflow graph. Our experiments on an 8-GPU machine show that Tofu enables the training of very large CNN and RNN models. It also achieves better performance than alternative approaches to train very large models on multiple GPUs.
Deep learning systems have become vital tools across many fields, but the increasing model sizes mean that training must be accelerated to maintain such systems' utility. Current systems like Tensorflow and MXNet focus on one specific parallelization strategy, data parallelism, which requires large training batch sizes in order to scale. We cast the problem of finding the best parallelization strategy as the problem of finding the best tiling to partition tensors with the least overall communication. We propose an algorithm that can find the optimal tiling. Our resulting parallelization solution is a hybrid of data parallelism and model parallelism. We build the SoyBean system that performs automatic parallelization. SoyBean automatically transforms a serial dataflow graph captured by an existing deep learning system frontend into a parallel dataflow graph based on the optimal tiling it has found. Our evaluations show that SoyBean is 1.5x-4x faster than pure data parallelism for AlexNet and VGG. We present this automatic tiling in a new system, SoyBean, that can act as a backend for Tensorflow, MXNet, and others.