Nvidia-smi provides a treasure trove of information ranging from GPU specifications and usage to temperature readings and power management. Let’s explore some of its use cases and highlight its importance in the realm of GPU management.
At the forefront of its capabilities, nvidia-smi excels in real-time monitoring of GPU performance. This includes tracking GPU utilization, which tells us how much of the GPU’s computational power the system is currently using.
Also, it monitors memory usage, an essential metric for understanding how much of the GPU’s Video RAM (VRAM) applications are occupying, which is crucial in workload management and optimization.
Moreover, nvidia-smi provides real-time temperature readings, ensuring that the GPU operates within safe thermal limits. This aspect is especially important in scenarios involving continuous, intensive GPU usage, as it helps in preventing thermal throttling and maintaining optimal performance.
Nvidia-smi isn’t just about monitoring, as it also plays a pivotal role in hardware configuration. It allows us to query various GPU attributes, such as clock speeds, power consumption, and supported features. This information is vital if we’re looking to optimize our systems for specific tasks, whether it’s for maximizing performance in computationally intensive workloads or ensuring energy efficiency in long-running tasks.
Furthermore, nvidia-smi provides the capability to adjust certain settings like power limits and fan speeds, offering a degree of control to us if we want to fine-tune our hardware for specific requirements or environmental conditions.
When troubleshooting GPU issues, nvidia-smi is an invaluable asset. It offers detailed insights into the GPU’s status, which is critical in diagnosing these issues.
For instance, if a GPU is underperforming, nvidia-smi can help us identify whether the issue is related to overheating, excessive memory usage, or a bottleneck in GPU utilization. This tool also helps in identifying failing hardware components by reporting errors and irregularities in GPU performance.
This option lists all GPUs in the system:
$ nvidia-smi -L GPU 0: NVIDIA GeForce RTX 2060 SUPER (UUID: GPU-fb087aea-1cd3-0524-4f53-1e58a5da7a3c)
It’s particularly useful for quickly identifying the GPUs present, especially in systems with multiple GPUs.
1. Query the VBIOS version of each device:
$ nvidia-smi --query-gpu=gpu_name,gpu_bus_id,vbios_version --format=csv name, pci.bus_id, vbios_version NVIDIA GeForce RTX 2060 SUPER, 00000000:03:00.0, 90.06.44.80.98
Query | Description |
---|---|
timestamp | The timestamp of where the query was made in format "YYYY/MM/DD HH:MM:SS.msec". |
gpu_name | The official product name of the GPU. This is an alphanumeric string. For all products. |
gpu_bus_id | PCI bus id as "domain:bus:device.function", in hex. |
vbios_version | The BIOS of the GPU board. |
2. Query GPU metrics
This query is good for monitoring the hypervisor-side GPU metrics. This query will work for both ESXi and XenServer.
$ nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 5 timestamp, name, pci.bus_id, driver_version, pstate, pcie.link.gen.max, pcie.link.gen.current, temperature.gpu, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2024/01/31 07:52:12.927, NVIDIA GeForce RTX 2060 SUPER, 00000000:03:00.0, 525.78.01, P0, 3, 3, 35, 0 %, 0 %, 8192 MiB, 7974 MiB, 0 MiB 2024/01/31 07:52:17.929, NVIDIA GeForce RTX 2060 SUPER, 00000000:03:00.0, 525.78.01, P0, 3, 3, 36, 0 %, 0 %, 8192 MiB, 7974 MiB, 0 MiB 2024/01/31 07:52:22.930, NVIDIA GeForce RTX 2060 SUPER, 00000000:03:00.0, 525.78.01, P0, 3, 3, 37, 0 %, 0 %, 8192 MiB, 7974 MiB, 0 MiB
You can get a complete list of the query arguments by issuing: nvidia-smi --help-query-gpu. When adding additional parameters to a query, ensure that no spaces are added between the queries options.
Query | Description |
---|---|
timestamp | The timestamp of where the query was made in format "YYYY/MM/DD HH:MM:SS.msec". |
name | The official product name of the GPU. This is an alphanumeric string. For all products. |
pci.bus_id | PCI bus id as "domain:bus:device.function", in hex. |
driver_version | The version of the installed NVIDIA display driver. This is an alphanumeric string. |
pstate | The current performance state for the GPU. States range from P0 (maximum performance) to P12 (minimum performance). |
pcie.link.gen.max | The maximum PCI-E link generation possible with this GPU and system configuration. For example, if the GPU supports a higher PCIe generation than the system supports then this reports the system PCIe generation. |
pcie.link.gen.current | The current PCI-E link generation. These may be reduced when the GPU is not in use. |
temperature.gpu | Core GPU temperature. in degrees C. |
utilization.gpu | Percent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product. |
utilization.memory | Percent of time over the past sample period during which global (device) memory was being read or written. The sample period may be between 1 second and 1/6 second depending on the product. |
memory.total | Total installed GPU memory. |
memory.free | Total free memory. |
memory.used | Total memory allocated by active contexts. |
Add the option "-f
Prepend "timeout -t
Ensure the your query granularity is appropriately sized for the use required:
Purpose | nvidia-smi "-l" value | interval | timeout "-t" value | Duration |
---|---|---|---|---|
Fine-grain GPU behavior | 5 | 5 seconds | 600 | 10 minutes |
General GPU behavior | 60 | 1 minute | 3600 | 1 hour |
Fine-grain GPU behavior | 3600 | 1 hour | 86400 | 24 hours |
Create a shell script to automate the creation of the log file with timestamp data added to the filename and query parameters.
!/bin/bash while true; do /usr/bin/nvidia-smi >> /home/username/gpu_logs.txt sleep 600 # 10 minutes done
Here, the script continuously logs the output of nvidia-smi to gpu_logs.txt every 10 minutes. Let’s save our Bash script as gpu_monitor.sh, and after doing so, we should remember to make it executable with the chmod command, then run the script:
$ chmod +x gpu_monitor.sh $ ./gpu_monitor.sh
We can also set this script to run at startup or use a tool like screen or tmux to keep it running in the background.
Alternatively, we can add a custom cron job to /var/spool/cron/crontabs to call the script at the intervals required. We can access the cron schedule for our user by running crontab -e in our terminal:
$ crontab -e
This opens the cron schedule in our default text editor. Then, we can schedule nvidia-smi to run at regular intervals. For example, we can run nvidia-smi every 10 minutes via the cron schedule:
*/3 * * * * /usr/bin/nvidia-smi >> /home/username/gpu_logs.txt
With this in the cron schedule, we append the output of nvidia-smi to a log file gpu_logs.txt in our user home directory every 3 minutes. We should remember to save the cron schedule and exit the editor. The cron job is now set up and will run at our specified intervals.
Any settings below for clocks and power get reset between program runs unless you enable persistence mode (PM) for the driver.
Also note that the nvidia-smi command runs much faster if PM mode is enabled.
nvidia-smi -pm 1 — Make clock, power and other settings persist across program runs / driver invocations
$ nvidia-smi -pm 1 Enabled persistence mode for GPU 00000000:03:00.0. All done.
Command | Description |
---|---|
nvidia-smi -ac | View clocks supported |
nvidia-smi –q –d SUPPORTED_CLOCKS | Set one of supported clocks |
nvidia-smi -q –d CLOCK | View current clock |
nvidia-smi --auto-boost-default=ENABLED -i 0 | Enable boosting GPU clocks (K80 and later) |
nvidia-smi --rac | Reset clocks back to base |
Command | Description |
---|---|
nvidia-smi –pl N | Set power cap (maximum wattage the GPU will use) |
nvidia-smi -pm 1 | Enable persistence mode |
nvidia-smi stats -i | Command that provides continuous monitoring of detail stats such as power |
nvidia-smi --query-gpu=index,timestamp,power.draw,clocks.sm,clocks.mem,clocks.gr --format=csv -l 1 | Continuously provide time stamped power and clock |
Adjusting the power limit can help in balancing performance, energy consumption, and heat generation. First, we can view the current power limit:
$ nvidia-smi -q -d POWER ==============NVSMI LOG============== Timestamp : Wed Jan 31 08:58:41 2024 Driver Version : 525.78.01 CUDA Version : 12.0 Attached GPUs : 1 GPU 00000000:03:00.0 Power Readings Power Management : Supported Power Draw : 10.59 W Power Limit : 175.00 W Default Power Limit : 175.00 W Enforced Power Limit : 175.00 W Min Power Limit : 125.00 W Max Power Limit : 175.00 W Power Samples Duration : 0.14 sec Number of Samples : 8 Max : 28.37 W Min : 10.30 W Avg : 13.28 W
This command shows the current power usage and the power management limits. Let’s now change the power limit:
$ nvidia-smi -pl 150 Power limit for GPU 00000000:03:00.0 was set to 150.00 W from 175.00 W. All done.
We can replace 150 with our desired power limit in watts. Notably, the maximum and minimum power limits vary between different GPU models. In addition, while adjusting GPU settings, especially power limit, we must be cautious with overclocking. Pushing the GPU beyond its limits can lead to instability or damage.