Multi-Process Service (MPS) of NVIDIA GPUs

10 min readJul 27, 2022

This comprehensive article gathers all the needed information to understand the Multi-Process Service (MPS) capability of NVIDIA GPUs. This article is like a timeline story elaborating on required concepts to understand how MPS works. It starts with CPUs and GPUs differences, then elaborates on the underutilization issue of GPUs; afterward, it delves into CUDA streams, discusses Hyper-Q, and finally gets to MPS. If the reader wants to go directly to MPS, they can scroll down to the MPS section or use the Nvidia documentation, as found in the references section.

Introduction

Graphical processing units (GPUs) offer significantly higher performance gains to applications with parallel nature in their calculations compared to central processing units (CPUs). However, developing a sequential program is much easier than a parallel one as it does not require parallel thinking and problem-solving skill. On the other hand, the design and debugging of parallel programs usually demand more time and brain effort. However, this is the cost of getting speedup at the software level. In addition, GPUs have much more cores compared to CPUs. Consider the following example in terms of CPU and GPU specification differences. A100 GPU has ~7000 simple cores, while AMD EPYC 7742 has only 64 complex cores.

CPU and GPU specification difference [NVIDIA A100][AMD EPYC 7742]

The following figure shows how they are different in terms of their architecture. CPUs employ large cache hierarchy systems, which makes them latency-oriented processors, while GPUs are throughput-oriented.

Thus, GPU compacts many hardware resources that a program must utilize. However, most of the time, developing or having programs to use the resources is impossible. This issue is well-known as the “under-utilization” problem. A solution would be running several applications on the same GPU simultaneously to increase the utilization. Nevertheless, GPUs, unlike CPUs, lack fine-grained sharing mechanisms. Also, GPUs do not have virtual memory. Furthermore, context switching imposes super expensive overhead since the amount of the data to be moved usually is enormous. Therefore, NVIDIA first, in February 2013, introduced Hyper-Q technology for enabling several CPU threads to launch work on a single GPU.

NVIDIA Hyper-Q Technology

Hyper-Q enabled several CPU threads to launch kernels on a single GPU resulting in increased GPU utilization and decreased CPU idle times. Hyper-Q also eliminates false dependencies to increase GPU utilization. To delve into how Hyper-Q works, first, we need to review CUDA streams and understand how they work, which were the primary processing way with GPUs. NVIDIA introduced Hyper-Q with Kepler architecture (2012) after Fermi (2010).

CUDA Streams

To ease the process of learning just remember that we usually use different streams for overlapping data transfers (data transfers happen between CPU memory and GPU memory) and computations or computations. A CUDA stream is a queue of kernels that execute on GPU in the order that they are launched by the CPU code. But, kernels from different streams can be interleaved.

Note that streaming multiprocessor is a concept different than streams.

All GPU operations (kernels and data transfers) in CUDA run in a stream. The default stream (or null stream) is used when no stream is specified in the kernel. The following sentences are copied from NVIDIA blog that describe default stream [ref].

No operation in the default stream will begin until all previously issued operations in any stream on the device have completed, and an operation in the default stream must complete before any other operation (in any stream on the device) will begin.

In CUDA 7 and newer versions, it is possible to have separate default streams per host thread. Consider the following CUDA C++ example [I use this example from NVIDIA blog post ref]. Instruction (1) copies data from the system's main memory to GPU’s memory. When the copying is finished, instruction 2 launches a kernel with 1 block composed of N threads. These threads are executed on GPU cores in parallel way. Then after finishing kernel execution, data copying from GPU memory starts.

cudaMemcpy(gpu_array, sys_array, numBytes, cudaMemcpyHostToDevice);   // (1)

increment<<<1, N>>(gpu_array);                                        // (2)

cudaMemcpy(sys_array, gpu_array, numBytes, cudaMemcpyDeviceToHost);   // (3)

GPU is unaware of what is happening on the CPU (host) side. In the following snippet, as instruction 2 launches on the device, instruction 3 starts to execute on the CPU. When kernel execution is finished on the GPU, instruction 4 will accomplish data movement.

cudaMemcpy(gpu_array, sys_array, numBytes, cudaMemcpyHostToDevice);   // (1)

increment<<<1, N>>(gpu_array);                                        // (2)

aCpuFunction(b);                                                      // (3)

cudaMemcpy(sys_array, gpu_array, numBytes, cudaMemcpyDeviceToHost);   // (4)

After creating a non-default stream, we can specify the stream that we want our kernel to be launched on as follows:

cudaStream stream1;
cudaError_t result;
result = cudaStreamCreate(&stream1); // creating a non-default stream

increment<<<1, N, 0, stream1>>>(gpu_array);
result = cudaMemcpyAsync(gpu_array, sys_array, N, cudaMemcpyHostToDevice, steam1);

result = cudaStreamDestroy(stream1); // destroying the previously created stream

cudaMemcpyAsync used on a non-default stream, overlaps CPU computations with GPU computation. cudamemcpyAsync returns the control to the host thread after issuing the copying operation.

The following example from NVIDIA blog clearly shows how wisely using streams can result in higher performance because of more parallelism.

const int block_size = 256, number_of_streams = 4;
const int n = 4 * 1024 * block_size * number_of_streams;
const int stream_size = n / number_of_streams;
const int stream_bytes = stream_size * sizeof(float);
const int bytes = n * sizeof(float);

// Version 1

for (int i = 0; i < number_of_streams; ++i) {
  int offset = i * stream_size;
  cudaMemcpyAsync(&gpu_array[offset], &sys_array[offset], stream_bytes, cudaMemcpyHostToDevice, stream[i]);
  kernel<<< stream_size/ block_size, block_size, 0, stream[i]>>>(gpu_array, offset);
  cudaMemcpyAsync(&sys_array[offset], &gpu_array[offset], stream_bytes, cudaMemcpyDeviceToHost, stream[i]);
}

// Version 2

for (int i = 0; i < number_of_streams; ++i) {
  int offset = i * stream_size;
  cudaMemcpyAsync(&gpu_array[offset], &sys_array[offset], stream_bytes,cudamemcpyHostToDevice, cudamemcpyHostToDevice, stream[i]);
}

for (int i = 0; i < number_of_streams; ++i) {
  int offset = i * stream_size;
  kernel<<<stream_size/ block_size, block_size, 0, stream[i]>>>(gpu_array, offset);
}

for (int i = 0; i < number_of_streams; ++i) {
  int offset = i * stream_size;
  cudaMemcpyAsync(&gpu_array[offset], &sys_array[offset], stream_bytes,cudamemcpyDeviceToHost, cudamemcpyDeviceToHost, stream[i]);
}

The result of execution on a GPU device with single copy and kernel engines queuing them is shown below. The sequential version is when cudaMemcpy is used instead of cudaMemcpyAsync.

image credit [NVIDIA blog post: How to Overlap Data Transfers in CUDA C/C++]

The result of execution on a GPU with two copy and one kernel execution stream is shown down.

When multiple kernels are launched on GPU in different non-default streams, the GPU scheduler tries to enable concurrent execution of these kernels. GPU delays the finish signal that normally occurs after each kernel completion, which is a start sign for memory copy operations. So, while there is overlap between host-to-device transfers and kernel execution in the second version of our asynchronous code, there is no overlap between kernel execution and device-to-host transfers.

Remember that asynchronous operations return the control to the host thread before the device has finished the requested job. So CPU can send more jobs, and engines queue them to execute. These commands are:

kernel launches
Memory copies between two addresses to the same device memory
Memory copies from host to device of a memory block of 64 KB or less
Memory copies are performed by functions with the Async suffix
Memory set function calls

To enable per-thread default streams in CUDA 7 and later, you can either compile with the nvcc command-line option --default-stream per-thread, or #define the CUDA_API_PER_THREAD_DEFAULT_STREAM preprocessor macro before including CUDA headers (cuda.h or cuda_runtime.h).

It is important to note: that you cannot use #define CUDA_API_PER_THREAD_DEFAULT_STREAM to enable this behaviour in a .cu file when the code is compiled by nvcc because nvcc implicitly includes cuda_runtime.h at the top of the translation unit.

A Multi-Stream Example

const int N = 1 << 20;

__global__ void kernel(float *x, int n)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
        x[i] = sqrt(pow(3.14159,i));
    }
}

int main()
{
    const int num_streams = 8;

    cudaStream_t streams[num_streams];
    float *data[num_streams];

    for (int i = 0; i < num_streams; i++) {
        cudaStreamCreate(&streams[i]);
 
        cudaMalloc(&data[i], N * sizeof(float));
        
        // launch one worker kernel per stream
        kernel<<<1, 64, 0, streams[i]>>>(data[i], N);

        // launch a dummy kernel on the default stream
        kernel<<<1, 1>>>(0, 0);
    }

    cudaDeviceReset();

    return 0;
}

if the code is compiled with the standard nvcc command as follows:

$ nvcc ./stream_test.cu -o stream_legacy

Because the per-thread default stream is not enabled, the device usually executes. Profiling results will show:

image credit [NVIDIA blog, “GPU Pro Tip: CUDA 7 Streams Simplify Concurrency”]

However, with the following command:

$ nvcc --default-stream per-thread ./stream_test.cu -o stream_per-thread

It will execute the same program as follows:

Multi threads can launch kernels on the device that can be parallelized with the help of per-thread default streams introduced in CUDA 7. Check here if you want to see the code and investigate more.

Hyper-Q technology

Before Hyper-Q and Kepler (2012), different threads could submit tasks on different streams (CUDA 7+). The work distributor used to take work from the head of the pipeline and farming work on the available SMs after checking that all dependencies were satisfied.

image credit [Hyper-Q Example by NVIDIA]

With Fermi architecture’s single pipeline (one execution engine, remember the example in streams), this depth-first launch sequence will result in false dependencies. As a result, the hardware can only determine that it can execute the shaded pairs concurrently.

With Kepler architecture and Hyper-Q, a grid management unit (GMU) was introduced. GMU creates multiple hardware work queues to reduce or eliminate false dependencies.

SMX stands for Streaming Multiprocessor neXt generation!

The following example shows how executions of a snippet can be different on devices without and with Hyper-Q.

for (int i = 0; i < number_of_streams; i++) {
kernel_A <<<1, 1, 0, streams[i]>>>(&gpu_array[2 * i], time_clocks);
kernel_B <<<1, 1, 0, streams[i]>>>(&gpu_array[2 * i + 1], time_clocks);
}

Without Hyper-Q:

With Hyper-Q that eliminates false dependencies:

MPS

It is an alternative and binary-compatible implementation of the CUDA API. MPS enables cooperative multi-process CUDA applications, typically MPI jobs, to utilize Hyper-Q capabilities on the NVIDIA GPUs with Kepler-based or newer architectures. Hyper-Q makes it possible to process CUDA kernels concurrently on a GPU, which benefits performance when the GPU compute capacity is underutilized by a single application.

Volta architecture-based MPS added new features compared to Pascal architecture. In Volta, QoS is respected so there is a limit for provisioning GPU. Also, there is GPU memory address space for all MPS clients.

To sum up, MPS increases GPU utilization and reduces on-GPU context storage and switching.

When to use MPS and How

It is beneficial to use MPS when each application’s work cannot saturate the GPU. Applications with a small number of blocks per grid cannot highly utilize GPU. When using MPS, GPU should be set to EXCLUSIVE _PROCESS compute mode to ensure that only a single MPS server uses the GPU to have a single arbitration point.

For using MPS, consider you have two different source codes and want to execute them on a GPU simultaneously. First, the GPU’s compute mode must be changed then the MPS server should be started.

$ nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
$ nvidia-cuda-mps-control -d

Then the applications can be launched as follows:

$ ./app1 &
$ ./app2 &
$ ./app3 &

For shutting it down the MPS server:

$ echo quit | nvidia-cuda-mps-control
$ nvidia-smi -i 0 -c DEFAULT

Note that:

Only one user on a system may have an active MPS server.
Exclusive-mode restrictions are applied to the MPS server, not MPS clients.

A script sample for using MPS is given as follows. Please pay attention that for setting a GPU to the exclusive execution mode, we have to have root privileges.

mkdir /tmp/mps_0
mkdir /tmp/mps_log_0

export CUDA_VISIBLE_DEVICES=0
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0
export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_0

nvidia-smi -i 0 -c EXCLUSIVE_PROCESS

nvidia-cuda-mps-control -d

# Launching two applications on the GPU index = 0
application1 &
application2 &