How to Monitor GPU with Nvidia-SMI Command



What is Nvidia-smi Command Utility?

The NVIDIA System Management Interface (nvidia-smi) is a command-line utility based on the NVIDIA Management Library (NVML) designed to help manage and monitor NVIDIA GPU devices. It provides the best nvidia-smi commands to monitor and track your GPU, and it is installed with the CUDA toolkit.

NVIDIA-smi ships with NVIDIA GPU display drivers on Linux and 64-bit Windows. Nvidia-smi can report query information as XML or human-readable plain text to standard output or a file. See the nvidia-smi documentation for more details.

How to Run Nvidia-smi Command Utility？

The nvidia-smi command utility gets typically installed in the driver installation step. It cannot/does not get installed in any other installation steps. If you install an NVIDIA GPU driver using a repository that is maintained by NVIDIA, you will always get the nvidia-smi command utility with any recent driver install.

How to run nvidia-smi on Windows?

Nvidia-SMI is stored in the following location by default:C:\Windows\System32\DriverStore\FileRepository\nvdm*\nvidia-smi.exe. On my Windows 10 machine, nvidia-smi.exe can be found in C:\Windows\System32. Since C:\Windows\System32 is already in the Windows PATH, running nvidia-smi from the command prompt should now work out of the box.

If nvidia-smi is executed from the command prompt (CMD) in Windows, the following error is returned:

C:\Users>nvidia-smi
'nvidia-smi' is not recognized as an internal or external command, operable program or batch file.

Please go to the file browser and then go to C drive and type nvidia-smi in the search bar and sometimes when the exe file opens wait for the right click to enter the properties copy location path and then go to the command prompt and use the one in the previous step Change the copy path to the working directory, then write "nvidia-smi" and press Enter.

How to run nvidia-smi on Ubuntu Linux?

There exists a tool named “nvidia-smi” which helps to manage this hardware all through the terminal on the Linux operating system. When dealing with this tool, an error with the statement “nvidia-smi command not found” may occur on the system.

$ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

At this time, you need to check whether the nvidia driver is correctly installed, or update to the latest version of nvidia-drivers.

Examples of Nvidia-smi Commands

This command monitors your GPU. We will explain this utility in detail later.

$ nvidia-smi

This tool is similar to the above command except for the information displayed in detail.

$ nvidia-smi -a

It monitors your GPU every second, refreshing and tracking the output itself for each second.

$ watch –n 1 -d nvidia-smi

To list all available NVIDIA devices, run:

$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3060 Ti (UUID: GPU-fa3da260-9c42-828f-981a-f6d7b48d77b3)

To list certain details about each GPU, try:

$ nvidia-smi --query-gpu=index,name,uuid,serial --format=csv index, name, uuid, serial
0, NVIDIA GeForce RTX 3060 Ti, GPU-fa3da260-9c42-828f-981a-f6d7b48d77b3, [N/A]

To monitor overall GPU usage with 1-second update intervals:

$ nvidia-smi dmon
# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0     19     41      -     0      0      0      0    405    210
    0     19     41      -     0      0      0      0    405    210
    0     19     41      -     0      0      0      0    405    210

To monitor overall GPU usage with 1-second update intervals:

$ nvidia-smi pmon
# gpu        pid  type    sm   mem   enc   dec   command
# Idx          #   C/G     %     %     %     %   name
    0      22010     C    98    56     -     -   python3
    0      22010     C    98    56     -     -   python3
    0      22010     C    98    55     -     -   python3

Nvidia-smi command output metrics and detailed descriptions

Below is an output of "nvidia-smi" command line.

Two tables are generated as the output where ﬁrst reﬂects the information about all available GPUs (above example states 1 GPU). The second table tells you about the processes using GPUs. Let’s dig into it more.

Temp: Core GPU temperature is in degrees Celsius. We need not worry about it since it will be controlled by DBM datacentres except to care about your hardware. The above “44C” in the table shown is normal but give a call when it reaches 90+ C.

Perf: Denotes GPU’s current performance state. It ranges from P0 to P12 referring to maximum and minimum performance respectively.

Persistence-M: The value of the Persistence Mode flag where “On” means that the NVIDIA driver will remain loaded(persist) even when no active client such as Nvidia-smi is running. This reduces the driver load latency with dependent apps such as CUDA programs.

Pwr: Usage/Cap: It refers to the GPU’s current power usage out of total power capacity. It samples in Watts.

Bus-Id: GPU’s PCI bus id as “domain:bus:device.function”, in hex format which is used to filter out the stats of a particular device.

Disp.A: Display Active is a flag that decides if you want to allocate memory on a GPU device for display i.e. to initialize the display on GPU. Here, “Off” indicates that there isn’t any display using a GPU device.

Memory-Usage: Denotes the memory allocation on GPU out of total memory. Tensorflow or Keras(TensorFlow backend) automatically allocates whole memory when getting launched, even though it doesn’t require it.

Volatile Uncorr. ECC: ECC stands for Error Correction Code which verifies data transmission by locating and correcting transmission errors. NVIDIA GPUs provide an error count of ECC errors. Here, the Volatile error counter detects the error count since the last driver loaded.

GPU-Util: It indicates the percent of GPU utilization i.e. percent of the time when kernels were using GPU over the sample period.

Compute M.: Compute Mode of specific GPU refers to the shared access mode where compute mode sets to default after each reboot. The “Default” value allows multiple clients to access the CPU at the same time.

GPU: Indicates the GPU index, beneficial for multi-GPU setup. This determines which process is utilizing which GPU. This index represents the NVML Index of the device.

PID: Refers to the process by its ID using GPU.

Type: Refers to the type of processes such as “C” (Compute), “G” (Graphics), and “C+G” (Compute and Graphics context).

Process Name: Self-explanatory

GPU Memory Usage: Memory of specific GPU utilized by each process.

Other metrics and detailed descriptions are stated on Nvidia-smi manual page.

How to Use Nvidia-smi Command on Windows and Ubuntu Linux