“Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications” Summary

5 min readFeb 17, 2022

Introduction

GPU computing has been popular since GPUs proved their significant performance gain over traditional CPUs for Deep Learning (DL) workloads. However, GPUs suffer from underutilization for these workloads. It is because of the nature of the DL workloads and GPUs’ lack of support for fine-grained sharing primitives. The proposed mechanism by Yu, Peifeng et al. [1] enables fast job switching (enabling time-sharing and preemption) and memory sharing achieving fine-grained GPU sharing among multiple DL applications. The key idea is to schedule at iteration granularity and utilize persistent memory of DL models. The results show significant improvements in completion time and GPU utilization.

Observations and Research Incentives

GPUs scheduling granularity is one GPU per application.
The first fact results in Head-Of-Line (HOL) blocking problem. DL applications are usually long tasks, especially training.
DL jobs usually do not fully utilize GPUs. The authors’ observations show that GPU memory utilization is often less than 50%.
Automatic hyperparameter tuning of DL models generates many training jobs parallelly. Many of them are killed because of their poor quality.
They observe that Model and Framework-internal memory allocations are significantly less than the space that is used as a scratch-pad.

Proposed Mechanism

The following figure shows the architectural overview of the Salus. When a DL job is created by a user, the Salus adopter creates a corresponding session (1a). The computation graph of the DL job is also transferred to Salus. Then the session proceeds to request a lane from the memory manager (1b). Depending on current jobs in the system, this process can block and the session will be queued. During the job’s runtime, either training or interfering, iterations are generated by the user script and forwarded to the corresponding session in Salus (2a). They are scheduled according to their associated GPU lanes by the iteration scheduler (2b) and sent to GPU for execution.

Note that the Salus execution service achieves GPU sharing via iteration-granularity scheduling of DL jobs.

Based on the observation (5), it is possible to keep more than one job’s persistent memory in GPU while still having enough space for either one’s epithermal memory.

Note that Salus is designed to enable significantly faster suspend/resume operations by keeping persistent memory around, and then an iteration-granularity job scheduler (e.g., time-sharing or preemption-based) decides which job’s iteration should be run next.

Keeping in mind that finer-grained scheduling also adds more overhead to the execution service. The finer granularity is kernel-level scheduling, which can end in a deadlock on the growing ephemeral memory of each kernel. Also, this level breaks common efficiency optimizations in DL frameworks such as kernel batching and pipelining. In contrast, iteration-granularity allows to sidestep the problem of progressive memory. This is because all ephemeral allocations are released by the framework after each iteration, and model and framework-internal allocations remain constant across iterations.

GPU Lane

The authors used lane for referring to the chunks of ephemeral memory of each GPU. The persistent memory is set and put aside. The ephemeral region is divided into lanes, which are continuous memory spaces that can contain ephemeral memory allocations for iterations.

Lanes are not only about memory. Iteration execution is serialized within a lane and parallelism is achieved across lanes, which is implemented using GPU streams (concurrency). The defragmentation happens at the end of each iteration, which is a result of choosing iteration-granularity. The following figure shows when the small job stops, its lane is quickly reclaimed at the iteration boundary by the job that was allocated below it.

Salus uses a heuristic algorithm for determining the size and number of the lanes in the GPU, as well as how lanes are assigned to jobs. From the highest level, the algorithm tries to open a new lane, use an existing lane, reorganize lane assignments to existing jobs to reduce the size of the ephemeral region. You can check the algorithm in the paper [1].

Scheduling Policies in Salus

Salus uses PACK for improving resource utilization, SRTF for preventing the HOL blocking issue, and FAIR for equalizing resource shares of concurrent jobs. Safety condition is considered in PACk policy to prevent crashes ensuring that persistent and ephemeral regions never collide. The authors mention that they consider that the execution time is known and thus it is possible to implement STRF. Other approaches try to estimate the execution time [2].

Evaluation

Salus is integrated with TensorFlow and evaluated using a collection of training, hyperparameter tuning, and inference workloads.

Baseline: FIFO Scheduling used in clusters, and MPS.

Using SRTF policy outperforms FIFO by 3.18X.
Increased GPU utilization by running several DL jobs by 2.38X during hyperparameter tuning.
GPU utilization improvement over not sharing and MPS, 42X and 7X respectively.

They do their runs on Intel CPUs and two Nvidia P100 GPUs.

Overhead of Salus

The authors mention that the proposed mechanism’s overhead is that some DL models like Auto Encoder and Super Resolution also perform large portions of CPU processing in addition to the heavy GPU computation. Since Salus implements its execution engine, the CPU computation is also redirected and sent to Salus for execution, which is not well optimized.

Conclusion

Although GPUs are ubiquitously used in DL applications, they suffer from low utilization. It is because of the DL applications’ nature and GPUs’ lack of fine-grain sharing primitives. Salus proposes a new mechanism to alleviate these issues by scheduling at iteration-granularity and keeping persistent data of each model for enabling fast job switching. It achieves considerable improvements with some overheads for applications demanding high CPU computation alongside GPU computation.

References

[1] Yu, Peifeng, and Mosharaf Chowdhury. “Salus: Fine-grained GPU sharing primitives for deep learning applications.” arXiv preprint arXiv:1902.04610 (2019).

Peng, Yanghua, et al. “Optimus: an efficient dynamic resource scheduler for deep learning clusters.” Proceedings of the Thirteenth EuroSys Conference. 2018.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

🔵 Become a Writer