“Gandiva: Introspective Cluster Scheduling for Deep Learning” Summary

4 min readNov 22, 2021

Introduction

Nowadays, there is a significantly growing trend toward Artificial Intelligence (AI), especially Machine Learning (ML) and Deep Learning (DL). DL applications (e.g., voice and image recognition) can be seen in the services offered by IT technology leaders like Google. These applications have a remarkable influence on businesses. Hence, DL has become a vital workload in cloud data centers. On the other hand, DL is compute-hungry and as a result, is reliant on powerful GPUs. Renting a GPU-powered virtual machine (VM) costs ~10X of a regular one. Cloud companies manage their clusters with cluster schedulers to ensure efficient utilization of the GPUs. Unfortunately, today’s (paper published in 2018) common practice is using traditional cluster schedulers (Apache YARN, Kubernetes), which are fit for handling big-data jobs. The downward of these schedulers in the DLT scope is that they assign GPUs to a job on its startup and wait until completion while queuing other jobs. Also, they look at jobs like black boxes and do not consider their behavior or characteristics. Gandiva [1] suggests a cluster scheduler aware of DL training jobs characteristics to reach less latency and higher cluster efficiency (utilization).

Deep Learning Training (DLT) Challenges

DLT jobs do feed-back-driven exploration or they are based on a trial-and-error mechanism (hyperparameter search, which can be manual or automated). Users typically try several configurations of a job and use early feedback to decide whether to prioritize or kill a subset of them. Using traditional schedulers leads to head-of-line-blocking (because of fixed and exclusive scheduling policies), as long-running jobs (like training jobs taking hours to months) exclusively access to the GPUs until completion, while multi-jobs waiting for early feedback wait (HIGH LATENCY). As a result, users use reserved or over-provisioned GPUs that reduces cluster efficiency. DLT jobs have different memory usage, GPU core utilization, sensitivity to interconnect, bandwidth, and interference with other jobs. Some of them might perform much better with GPU grouping, while others may not be that sensitive to grouping (on data-parallelism training). As a traditional scheduler treats a job as a black-box will achieve sub-optimal efficiency (LOW EFFICIENCY) if GPUs are given to DLT jobs fixed and exclusive scheduling policies.

What Gandiva Proposes?

Gandiva, the proposed mechanism by Wencong Xiao et al. [1], address the aforementioned problems by exploiting “intra-job predictability” of DL jobs. A DLT job is composed of a lot of similar separated mini-batch iterations. For instance, the GPU memory usage of a DLT job follows a cyclic pattern aligned with mini-batch boundaries, usually with more than a 10X difference in GPU RAM usage within a mini-batch, which is shown in the following figure. Gandiva uses this predictability to decrease the amount of data, which should be copied during the suspend-resume mechanism.

Gandiva uses this cyclic predictability to implement efficient application-aware time-slicing. It redefines the atom of scheduling from a job to automatically partitioned micro-tasks (one minute), enabling the cluster to over-subscribe DLT jobs and provide early feedback through time-slicing to all DLT jobs. Also, Gandiva uses the predictability to perform profile-driven self-examining (introspection) to improve cluster efficiency by using the mini-batch progress rate. Gandiva packs multiple jobs on the same GPU only when they have low memory and compute utilization. It dynamically migrates communication-intensive jobs to more tightly coupled GPUs. Also, it increases the degree of parallelism of a job to make use of the spare resources, then shrinks when the spare resources are gone away. The introspection policy is based on trial-and-error because of the predictability and the limited state space of options the authors consider. Besides, Gandiva provides other features (APIs) that can benefit any DLT scheduling policy. It is implemented by modifying TensorFlow and PyTorch for providing the necessary new primitives to the scheduler and developing an initial scheduling policy manager on top of Kubernetes and Docker containers.

The evaluation of Gandiva on a cluster of 180 heterogeneous GPUs shows 26% efficiency and 77% latency improvements.

Conclusion

Cloud companies for increasing their GPUs utilization use schedulers. This paper [1] presents a cluster scheduling framework for deep learning, which provides a set of efficient, low-level system primitives such as time-slicing, migration, intra-job elasticity, and dynamic priority. This framework addresses the aforementioned problems (high latency, low utilization) of scheduling DLT jobs.

Future Reading

V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” in Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, Dec. 2017, DOI: 10.1109/JPROC.2017.2761740.

References

[1] Wencong Xiao et al. “Gandiva: introspective cluster scheduling for deep learning.” In Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation (OSDI’18), 595–610 (2018).