Linux Commands for Monitoring and Profiling GPU workloads + Linux must have commands

Ehsan Yousefzadeh-Asl-Miandoab
10 min readDec 6, 2021

--

In this story, I collect the most used Linux terminal commands while working on the DGX machine. It will develop over time. So, every comment is appreciated.

1 — General Linux Commands

If you are working with Windows git bash, I recommend you to add the following command to the end of your bashrc file which resides Git/etc folder. Also, add it to your remote machine’s .bashrc file.

...export PS1=' \n\[\e[97;41m\]\h \[\e[97;104m\] \u \[\e[30;43m\] \w \[\e[97;45m\] `__git_ps1` \[\e[0m\]\n $ '

The Current Directory

The following command shows the current directory that the command line is in. It gets important when you need to know where you are creating files, installing software, or downloading files. Note that it is possible to give a relative or absolute address to where to do those aforementioned jobs.

$ pwd

Make an alias for it.

...alias p='pwd'...

Note: Please make sure that the following line presents in the ~/.bashrc file because without it ~/.bash_aliases is an isolated island for itself.

if [ -f ~/.bash_aliases ]; then
. ~/.bash_aliases
fi

Finally, just don’t forget to run the following command to read and execute the bash_aliases file.

$ source ~/.bash_aliases

Creating a File

$ touch <fileName>.<extension>

Making a Directory

$ mkdir <directoryName>

Removing a Directory

$ rm -r <directoryName>

Renaming a Directory

$ mv current_name new_name

Coping a file of a directory to another directory

$ cp path1/<fileName1> path2/<fileName2>

Pay attention that the fileName1 and fileName2 can be the same or different.

Clearing the Terminal

$ clear

It is recommended to add the following alias to your ~/.bash_alias.

...alias c='clear'...

Now, you can clear your screen with just a pressing c and Enter. But, don’t forget to use the following command. It reads and executes the .bash_aliases file.

$ source ~/.bash_aliases

Downloading with terminal

wget <URL1> <URL2> …wget -o newFileName <URL>wget -P <destinationDirectory> <URL>wget -i <file_with_download_links>wget -c <URL> # It is for resuming the download from where it was left

Copying a local file to a remote machine and vice versa

$ scp <localPath>/<file> <remoteUser>@<remoteServer(name or IP)>:<RemotePathToCopy>

Copying from local to remote with scp.

$ scp -P <portNumber> <remoteUser>@<remoteServer(name or IP)>:<remotePath>/<file> <localPath>

Copying from remote to local with scp.

$ sftp  <user>@<serverName or IP>Connected to <serverName or IP>sftp>  dirfileName1  fileName2  fileName3   sftp>  pwdRemote working directory: /home/<user>sftp>  get fileName3Fetching /home/<user>/fileName3 to fileName3fileName3100% XXX.XKB XXX.XKB/s   XX:XXsftp> bye

As shown in the example, copying can be done by using sftp.

Running in background

Just simply by using ampersand character after the command.

$ command &

But, the command will keep writing to the terminal. For preventing this writing:

$ command > /dev/null 2>&1 &

The command above redirects stdout (standard output that writes to the terminal) to /dev/null and stderr (standard error) to stdout.

To see the jobs running in the background:

$ jobs -l

A job with its ID can be brought to the background.

$ fg %1

If there is just one job, it is not needed to specify the job ID.

For killing it, process ID can be used as follows:

$ kill -9 pID

Disowning a Job

When you need to close the SSH connection, but want a process to continue, do the following steps:

  1. Use Ctrl+z
  2. Use bg command (for sending the job to run in the background)
  3. Use disown command to disown it and wait until it is finished. Noting that it can be killed at the will of the user.
CTRL + z$ bg# output will appear with a number inside brackets [pNum], use the one you # want to disown$ disown -h %pNum 

Execution Time of a Command

The execution time of a command can be observed with the following command:

$ time -p python python_script.py

For running time command with another program and saving both results on different files

$ {time python python-script.py 1>out.txt 2>error.txt} 2>time.txt

Note 1. File descriptors 1 and 2 are the standard output and error.

System Info Check

$ lscpu

This command gives us information about the installed processor on our computer. Information like the processor’s architecture, number of cores, threads per core, cores per socket, number of sockets, NUMA nodes, processor model name, caches’ sizes, etc.

$ top$ top -i -b -n 1000 > top-5iterations.txt 
# i just takes into the processes that use fair amount of resources, b is for taking snapshots of the top output, and n specifies the number of rounds that we want the top output

This command shows the running processes with their process id (PID), the user (USER), priority (PR), spent time (TIME+), using shared memory (SHR), CPU, and memory usages (%CPU, %MEM) with the total virtual memory used by the task (VIRT) and the resident size which is in the DRAM.

Press ‘q’ if you want to put an end to what is shown dynamically on the screen.

$ top -<user>

This one shows the list of processes launched by a specific user.

$ htop

Another monitoring command with more features.

File System Check

$ df -Th$ lsblk$ lsblk -f$ lsblk | grep disk

By using these commands, you can check a system’s storage subsystem.

Checking Disk Usage by different Users

Go to the /home, then run the following command:

$ sudo du -d 1 -h

Meaning of Filesystem names and their Types

This part will be developed very soon.

Memory on the system

$ cat /proc/meminfo$ cat /proc/meminfo | grep “MemTotal”

These commands give you information about the installed main memory on the system. You can see more information about the processor by viewing what there is in /proc/cpuinfo. You can also find other interesting things in the /proc/ folder like ‘uptime’, which contains two times: the first one shows the time that the system was on and serving the users, the second one shows the sum of idle seconds of all processors.

From which executable file a command executes

$ whereis <command>

This command points to the exe file and where it resides.

Executing a command in a loop

For this purpose, we can use the watch command.

$ watch date # updates every 2 seconds

If the description is not needed, so:

$ watch -t date$ watch --no-title date

The interval can be changed as follows by using -n.

$ watch -n 1 date

By using -d, the differences are highlighted in the screen for better seeing.

$ watch -t -n 1 -d date

using the following configuration, whatever changes once, its highlight won’t fade away.

$ watch -d=cumulative COMMAND$ watch -t -n 1 “date | grep ‘23’”$ watch -n 1 mpstat -P ALL

Running several commands simultaneously from a bash script file

The following snippet shows how we can run two commands in a bash script then simply by enteringCtrl+C, we can terminate the script and put an end to the execution of all those commands. We run the script files, which end in the .sh extension, by the following command.

$ bash script_name.sh

The inside of the script_name.sh file:

nvidia-smi dmon > nvidia-smi-log.txt &dcgmi dmon -e 204,449,203,450,1002,1003,1004,1005,1006,1007,1008 > dcgmi-log.txt &wait

Getting the execution time of a process

We use the ps command for this purpose.

First, we find the process id of that process:

$ pidof name_of_the_process
1212

Then, the following command will give us the elapsed time by it.

$ ps -p 1212 -o etime
ELAPSED
05-11:03:02

For listing all processes with different information

$ ps -eo pid,lstart,etime,args

CUDA Version

$ nvcc --version$ nvidia-smi

The version of the CUDA framework can be checked with either of the instructions.

Finding Processes using Nvidia and killing them

It may be helpful when you are updating the driver for killing the processes that keep some files open. This message “An NVIDIA kernel module ‘nvidia’ appears to already be loaded in your kernel” would pop up when updating drivers.

$ sudo lsof -n -w /dev/nvidia*
$ sudo kill -9 PID # PID is process ID

2 — Nvidia System Management Interface (SMI) Commands

Note that Nvidia’s smi tool supports any Nvidia GPU released since the year 2011. These include the Tesla, Quadro, and GeForce devices from Fermi and higher architecture families (Kepler, Maxwell, Pascal, Volta, Turing, Ampere).

$ nvidia-smi$ watch nvidia-smi$ nvidia-smi -i <id> # Selects just the specific$ nvidia-smi -pm 1  # Setting all GPUs on Persistence Mode*$ nvidia-smi -L # Listing installed GPUs without details$ nvidia-smi -q # Listing GPUs informations like lscpu$ nvidia-smi -q -i <id> # Selects just one specific device$ nvidia-smi -l -i <id> # looking for a GPU's processes online in a loop (l stands for loop)$ nvidia-smi --query-gpu=index,name,uuid,serial,driver_version,gpu_bus_id,pci.domain,mig.mode.current --format=csv # giving out the information mentioned$ timeout -t 3000 nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1 > results-file.csv # Putting the information in a log file and doing the information gathering in a loop for a specified period of time$ nvidia-smi --format=noheader,csv --query-compute-apps=timestamp,gpu_name,pid,name,used_memory --loop=1 -f out.log # If there is no app on the system, you would not see anything in the log file$ nvidia-smi dmon -o DT # power, temperatures of GPU, memory+utilizations$ nvidia-smi pmon -o DT # showing the above information per process with 1 second intervals$ nvidia-smi -q -d SUPPORTED_CLOCKS # a list of available clock speedds$ nvidia-smi -q -d CLOCK # GPU clock speed, default clock speed, and maximum possible clock speed$ nvidia-smi -q -d PERFORMANCE # for reviewing the current state of each GPU and any reasons for clock slowdowns | HW Slowdown indicate a power or cooling issue, other show that the card is idle or has been manually set into a slower mode by a system administrator$ nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER,CLOCK,POWER,PERFORMANCE

Persistence Mode of GPUs. This mode keeps the Nvidia driver loaded even when no applications are accessing the cards. This is useful when there is a series of short jobs running. The persistence mode uses a few more watts per idle GPU, but it prevents the fairly long delays that occur each time a GPU application is started.

Note that for taking advantage of more advanced Nvidia GPU features like GPU Direct, it is vital that the system topology be properly configured.

$ nvidia-smi topo --matrix # Showing the topology of the GPUs with 1 second intervals$ nvidia-smi nvlink --status$ nvidia-smi nvlink --capabilities$ sudo nvidia-smi nvlink -e

For knowing more using the provided help is the option.

$ nvidia-smi -h$ nvidia-smi nvlink -h$ nvidia-smi topo -h$ nvidia-smi dmon -h$ nvidia-smi pmon -h

Note: There is a Linux tool like top and htop for GPUs: nvtop. However, it is not a built-in tool in Linux.

3- Data Center GPU Manager (DCGM) commands

In a simple sentence, the DCGM simplifies administrating of Nvidia GPUs in the cluster and datacenter environments by (1) monitoring GPU behavior, (2) managing GPU configuration, (3) oversight GPU policy, (4) performing diagnosis and checking GPU health, (5) providing process statistics, and finally (6) NVSwitch configuration and monitoring.

Groups in DCGM: Almost all operations take place in groups. Users can create, modify, and destroy collections of GPUs. Groups are intended to help the user manage collections of GPUs as a single abstract resource, usually correlated to the scheduler’s notion of a node-level job. The groups don’t need to be disjoint. For example, a group created to manage the GPUs associated with a single job might have the following lifecycle. During prologue operations, the group is created, configured, and used to verify the GPUs are ready for work. During epilogue operations, the groups are used to extract target information. And while the job is running, DCGM works in the background to handle the requested behaviors. Managing groups is very simple. Using the dcgmi group subcommand, the following example shows how to create, list, and delete a group.

[from Nvidia Documents]
$ dcgmi group -l # Listing all groups$ dcgmi group -c GPU_group1 # Creating a new group$ dcgmi group -d <group_id> # Removing a group$ dcgmi discovery -l # Listing all GPU to add to a group$ dcgmi group -g 1 -a 0,1 # Adding GPUs 0 and 1 to group 1$ dcgmi config -g 21 --get --verbose

Providing a complete review of DCMG is not possible by this article. Read the Nvidia’s documents if you need to do some configuration across multiple GPU in a group. However, the following command would be helpful when monitoring the GPU. Look here to see what those numbers mean, especially for the second command.

$ dcgmi dmon -e 204,449,203,450,1002,1003,1004,1005,1006,1007,1008$ dcgmi dmon -e 203,204,449,450,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012

The following command lists the metrics that can be monitored with DCGM. Even the codes above can be checked with that.

$ dcgmi dmon -l

Profiling with Nsight Systems and Compute

The following command shows how to profile an exe file with Nvidia Nsight Systems.

$ nsys profile <application>$ nsys profile --trace cuda,osrt \
--capture-range cudaProfilerApi \
--output /somewhere/baseline \
--force-overwrite true \
python3 /dli/task/nsys/application/main_baseline.py
$/home/<username>/nsight-systems-2021.5.1/bin/nsys profile --output MNIST_nsys_profile_overhead_monitoring_T5 ~/.conda/envs/ehsanenv/bin/python3 MNIST_CNN.py

The following command shows how to profile an exe file with Nvidia Nsight Compute. Note that the sudo is needed when doing the profiling with Nsight Compute.

sudo ncu -o <outputFileName> <exeApplicationName>\--section ComputeWorkloadAnalysis\--section InstructionStats\--section LaunchStats\--section MemoryWorkloadAnalysis\--section MemoryWorkloadAnalysis_Chart\--section Nvlink\--section Nvlink_Tables\--section MemoryWorkloadAnalysis_Tables\--section Occupancy\--section SchedulerStats\--section SourceCounters\--section SpeedOfLight\--section SpeedOfLight_RooflineChart\--section WarpStateStats$ ncu --section-folder /usr/local/cuda/nsight-compute-2021.3.1/sections/ a.out$ time ncu -o profile_with_compute_overhead_T1 --section-folder /usr/local/cuda/nsight-compute-2021.3.1/sections/ ~/.conda/envs/ehsanenv/bin/python3 MNIST_CNN.py--replay-mode application 
# Backing up device memory in system memory. Kernel replay might be slow. Consider using "--replay-mode application" to avoid memory save-and-restore.

For seeing the sections included for profiling.

$ ncu --list-sections

These sections and metrics can be changed.

After doing the profiling, the file can be copied to a local system for further investigation.

References

--

--

No responses yet