“Crossbow: Scaling Deep Learning with Batch Sizes on Multi-GPU Servers” Summary

9 min readNov 11, 2021

Introduction

The current age can be called the age of A.I. as Deep Learning (DL) has become ubiquitous, can be seen in a lot of applications around us. DL models usually are trained on servers with several Graphics Processing Units (GPUs). Current frameworks like TensorFlow train models with parallel synchronous stochastic gradient descent (S-SGD). They process a batch of training data at a time, partitioned across GPUs, and average the resulting gradients to obtain an updated global model. For increasing the utilization of GPUs, the common way is to increase the batch size, but it results in lower statistical efficiency. The common remedy for dealing with it is tuning hyperparameters such as learning rate, which is complex and model-specific. The proposed mechanism by Alexandros Koliousis et al. [1], named Crossbow, tries to increase hardware utilization regardless of the batch size. The main idea of the Crossbow is independent learners, which can do the training on their batch. It [1] introduces a new algorithm, a synchronous model of averaging, to adjust learners' models based on the trajectory of a globally-consistent average model. The evaluations by the authors show higher hardware utilization while using Crossbow, also lower training time of deep models on an 8-GPU server by 1.3–4X compared to TensorFlow.

Background

Statistical Efficiency

It is a measure of the quality of an estimator, of an experimental design, or a hypothesis testing procedure. Especially, a more efficient estimator, experiment, or test needs fewer observations than a less efficient one to achieve a given performance [2].

Hyperparameters and Parameters in Machine Learning (ML)

Hyperparameters are parameters whose values are used to control the learning process (the speed and quality of the learning process). Choosing them appropriately influences how well the learning algorithm will change model parameters to correctly map the output (label, target) to input (feature) to resemble intelligence. Noting that they are not a part of the final trained model. In ML and Deep Learning (DL) realm, anything whose value is set before the training process and remains fixed till the end of the training process is a hyperparameter. The topology and size of a Neural Network (NN) is a model hyperparameter while the learning rate and mini-batch size (next subsection) are algorithm hyperparameters.

On the other hand, parameters or in better words model parameters are learned from the data during the training process as the algorithm tries to map output to input. Examples of parameters are weights of linear or logistic regression models, weights, and biases of an NN.

Gradient Descent

Gradient means an inclined part of a road or railway; a slope

Descent means decreasing or falling.

Gradient Descent is an optimization algorithm used for finding the weights of ML algorithms like Neural Networks (NN). It uses the errors which the models make by predictions they make on the training data to update the model to reduce the error. Its goal is to find model parameters (weights) that minimize the error of the model on the training data. It operates by making changes to the model by moving it along a gradient or slope of errors down toward a minimum error value [3].

The number of training patterns used to calculate the error that is used for updating the model can be different, which specifies its type. The three main types of gradient descent are stochastic, batch, and mini-batch.

Stochastic Gradient Descent (SGD): is a kind of gradient descent algorithm that calculates the error and updates the model for each example in the training example. The advantages are being simple to understand for beginners, a faster learning process on some problems, and the noisy update may allow avoiding local minimum. The downsides can be mentioned as computationally expensive, noisy input makes model parameters jump around, also settling down may be harder with the noisy learning process [3].

Batch Gradient Descent (BGD): a variant of gradient descent that calculates the error for each example in the training dataset, but only updates the model after all training examples have been calculated. Usually, it is said that batch gradient descent performs model updates at the end of each training epoch (One time going through all training data is called a training epoch.) The advantages are more computationally efficient compared to SGD, separation of calculation of errors, and the model updates let the algorithm be implemented in a parallel way. Downsides would be counted: (1) requiring the entire dataset to reside in memory and be available to the algorithm, (2) training process becomes very slow for large datasets, (3) updates at the end of each epoch require additional complexity of accumulating prediction errors across all training examples [3].

Mini-Batch Gradient Descent (MBGD): a variant of the gradient descent that splits the training dataset into small batches that are used to calculate model error and update model parameters. This variation is a balance between stochastic and batch ones. It is the most common implementation of gradient descent used in the field of deep learning. Advantages are being more computationally efficient compared to the Stochastic one, and it is not required to keep all of the training datasets in memory [3]. Configuring batch size for different applications differs, but a power of two (32, 64, 128, 256, …)that fits the memory requirements of the GPU or CPU hardware is recommended. Number 32 is recommended for batch size in several academic papers. However, batch sizes of 64000 are not uncommon [3].

Parallel Synchronous Gradient Descent (S-SGD)

Parallel training approaches distribute the gradient computation across multiple GPUs but differ in how they synchronize the gradients. The widespread training algorithm is parallel synchronous SGD (S-SGD) [4]. It requires all GPUs to have a consistent view of the n-th version of the model before the (n+1)-th iteration starts: (1) at each iteration, S-SGD partitions a batch equally across GPUs; (2) each GPU computes a partial gradient from a batch partition and the latest model version; (3) GPUs then coordinate to merge partial gradients into an aggregate gradient before the next iteration. The following figure shows the execution of S-SGD on a server consisting of two GPUs [1].

Challenges in Scaling training

The batch size is a critical parameter for training with parallel S-SGD. If the batch is too small, the GPU is not fully utilized because the communication overhead dominates. In parallel training, the aggregate batch size must increase linearly with the number of GPUs. The following figure gives an insight on how things hardware efficiency, epoch time changes by batch size increase. Also, statistical efficiency is shown about that change [1].

The reason for more epochs for TensorFlow convergence when the batch sizes increased:

With large and redundant training datasets, small batches ensure faster training because only a few batches are sufficient to capture the dimensionality of the problem space and converge quickly to good solutions
A small batch size leads to noisier gradient updates which widen the exploration of the loss landscape, making it more likely to find better solutions with a higher test accuracy

A typical solution to mitigate this issue is hyperparameter tuning. But, it requires a time-consuming model-specific methodology, which is often beyond the reach of non-experts and cannot be applied easily to new models or hardware architectures. Noting that in some cases, even with hyperparameter tuning, it is hard to scale training time on multiple GPUs. A recent study [5] from Google Brain shows that Convolutional Networks exhibit only limited scaling with batch sizes larger than 64 and, for Recurrent Neural Networks, e.g., Long Short-Term Memory (LSTM), the threshold seems to be even lower (16). Thus, a general approach for scaling S-SGD on multiple GPUs, therefore, remains an open challenge due to the conflicting impact of large batch sizes on hardware and statistical efficiency.

The Proposed Mechanism

Alexandros Koliousis et al.’s approach [1] relies on the concept of a learner, which independently trains a model replica for a given input batch. To reach high statistical efficiency, they introduce a new algorithm named Synchronous Model Averaging (SMA) that consolidates the model updates computed by many learners.

Synchronous Model Averaging with Learners

Independent Learners Concept

Parallel S-SGD imposes tight synchronization when processing partitioned batches. The gradients computed based on all model replicas are aggregated, and the obtained result is incorporated by all replicas. After each iteration, before the computation of gradients for the next batch begin, all replicas are therefore the same [1].

A learner is an entity that trains a single model replica independently with given batch size. The reason for this abstraction is that it decouples that batch size from the degree of parallelism in the learning process. Each learner processes a batch of small size to achieve high statistical efficiency, while the number of learners offers a way to achieve high hardware efficiency. The following figure shows two learners, each of which processes its assigned batch. (1) A learner computes a gradient and immediately updates its replica based on the gradient. (2) It then continues with the gradient computation for the next batch. (3) Learners are prevented from diverging, each learner also applies a correction to its model which is incorporated synchronously as part of the next update of the replica.

SMA Algorithm

It is used for synchronizing the local models of learners which is based on model averaging [6, 7, 8]. SMA consolidates the model updates of learners by maintaining a central average model. The following figure shows how replicas w1 and w2 are trained with learners with independent batches. Once the learners have computed the gradients and updated their local replicas, the updates are applied to a central average model [1].

Because of the small used batch size, the hardware will remain underutilized, so several learners per GPU can be considered. It is because the learners decouple the processing of a batch from the available hardware, permitting the execution of multiple learners per GPU. When a large number of learners reside on a GPU, for synchronizing them, the on-chip shared memory can provide a significant speedup. The following figure shows the synchronization hierarchy that the authors of [1] propose. To synchronize the learners executing on a single GPU, one learner is chosen to manage a reference model. Each learner then computes the difference between its model replica and the local reference model. Then, the computed difference is applied to the respective model. At the global level, the SMA algorithm is executed. It uses one of the local reference models as the central average model: all other reference models (one per GPU) are replicas incorporated into the model averaging process.

The authors [1] suggest a tunning the number of learners per GPU based on the training throughput at runtime. By observing the number of batches per second, they detect the under- and over-utilization of a GPU.

Crossbow System Design

The design of the Crossbow exploits GPU capabilities efficiently to reach higher performance and utilization, and also all kernel calls to the GPU are non-blocking, and the thread returns immediately to schedule the next task providing more efficiency. They mention using streaming for getting more concurrency by launching more kernels at the time for synchronization and learning as the following figure illustrates.

Benchmarks used to evaluate Crossbar Mechanism

They use TensorFlow’s suite of benchmarks as the following table shows.

Conclusion

Future Reading

H. Zhu et al., “Benchmarking and Analyzing Deep Neural Network Training,” 2018 IEEE International Symposium on Workload Characterization (IISWC), 2018, pp. 88–100, DOI: 10.1109/IISWC.2018.8573476.

References

[1] Koliousis, Alexandros & Watcharapichat, Pijika & Weidlich, Matthias & Mai, Luo & Costa, Paolo & Pietzuch, Peter. “Crossbow: scaling deep learning with small batch sizes on multi-GPU servers.” Proceedings of the VLDB Endowment. 12. 1399–1412. 10.14778/3342263.3342276 (2019).

[2] Wikipedia contributors. “Efficiency (statistics).” Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 17 Jul. 2021. Web. 10 Nov. 2021.

[3] Machine Learning Mastery, accessed 10 November 2021, https://bit.ly/3bW5ZBt.

[4] Krizhevsky, Alex. “One weird trick for parallelizing convolutional neural networks.” ArXiv abs/1404.5997 (2014): n. pag.

[5] Shallue, Christopher J. et al. “Measuring the Effects of Data Parallelism on Neural Network Training.” ArXiv abs/1811.03600 (2019).

[6] Polyak, Boris. “New stochastic approximation type procedures.” Avtomatica i Telemekhanika. 7. 98–107 (1990).

[7] Polyak, Boris & Juditsky, Anatoli. “Acceleration of Stochastic Approximation by Averaging.” SIAM Journal on Control and Optimization. 30. 838–855. 10.1137/0330046 (1992).

[8] Ruppert, David. “Efficient Estimations from a Slowly Convergent Robbins-Monro Process” (1988).