Introduction to GPUs in NERC OpenShift

NERC OCP clusters leverage the NVIDIA GPU Operator as well as the Node Feature Discovery Operator to manage and deploy GPU worker nodes to clusters.

GPU nodes in NERC clusters are also managed via taints according to their GPU device. This ensures that only workloads explicitly requesting GPUs will consume GPU resources.

NERC GPU Worker Node Architecture

The NERC OpenShift environment currently supports two different NVIDIA GPU products:

NVIDIA-A100-SXM4-40GB (A100)
Tesla-V100-PCIE-32GB (V100)

NERC GPU Worker Nodes Info

A100 worker nodes contain 4 individual gpus, each with 40GB of memory whereas V100 worker nodes contain 1 gpu with 32 GB of memory.

Accessing GPU Resources

In order to use GPU resources on your pod, you must specify the number of GPUs you want to use in the "OpenShift Request on GPU Quota" attribute that has been approved for your "NERC-OCP (OpenShift)" resource allocation on NERC's ColdFront as described here.

Deploying Workloads to GPUs

There are two ways to deploy workloads on GPU nodes:

Deploy directly in your OCP namespace:

In your project namespace you can deploy a GPU workload by explicitely requesting a GPU in your pod manifest, see: How to specify pod to use GPU.
Deploy through NERC RHOAI

See Populate the data science project with a Workbench for selecting GPU options.

GPU Accelerator on NERC RHOAI

The different options for GPU accelerator are "NONE", "NVIDIA A100 GPU", and "NVIDIA V100 GPU" as shown below: