GPU Resources#

GPUs (Graphics Processing Units) are essential for tasks requiring high parallelism, such as deep learning and simulations. In the context of HPC these devices no longer perform any graphics function, but the “graphics” name has stuck vs the more apt accelerator term. To lean more about accelerated computing visit https://blogs.nvidia.com/blog/what-is-accelerated-computing/ .

NVIDIA GPUs are available in gpu and preempt public partitions as well as some labs private partitions. When scheduling batch or interactive jobs that need GPU you add the --gres option to your command. This allows you to select the type and quantity of GPUs needed.

The simplest request is --gres=gpu:1, e.g.

$ srun -p preempt -n 2 --mem=4g --gres=gpu:1 -t 1:00:00 --pty bash
- --gres : Generic Resource
- gpu:1 : requesting one GPU (any GPU architecture available in the request partition)
To request a specific GPU architecture you can add it to the gres --gres=gpu:t4:1, e.g.

$ srun -p preempt -n 2 --mem=4g --gres=gpu:t4:1 -t 1:00:00 --pty bash
- --gres : Generic Resource
- gpu:t4:1 : requesting one T4 GPU.
You can request more than one type (but not all) of GPUs with constraint, e.g.

$ srun -p preempt -n 2 --mem=4g --gres=gpu:1 --constraint="t4|p100|v100" -t 1:00:00 --pty bash
- --constraint : set constraints on the resources allocated for the task.
- t4|p100|v100 : indicates that the task can use a GPU of type t4, p100, or v100, where the | symbol means “or”.

Warning

DO NOT manually set CUDA_VISIBLE_DEVICES.
Users can only see GPU devices that are assigned to them by using the $ nvidia-smi command.
When submiting batch jobs, it’s recommended adding this command $ nvidia-smi in your slurm job submission script to include the output in the log for troubleshooting purposes.

Available GPUs:#

GPU Model	Memory	Max per node	Partition Available in	Constraints	Notes
a100	40G or 80G	8	gpu, preempt	a100 a100-40G a100-80G
p100	16GB	6	gpu, preempt	p100
v100	16GB	4	preempt	v100
t4	16GB	4	preempt	t4	Max CUDA version is 10.2
rtx_6000	24GB	8	preempt	rtx_6000
rtx_a6000	48GB	8	preempt	rtx_a6000
rtx_6000ada	48GB	4	preempt	rtx_6000ada
l40s	48GB	4	preempt	l40s
rtx_a5000	24GB	8	preempt	rtx_a5000
l40	48GB	4	preempt	l40
h100	80GB	3	preempt	h100

All GPU cards drivers except t4 support CUDA 12.2