Introduction to GPUs on NERC OpenShift

NERC OCP clusters leverage the NVIDIA GPU Operator as well as the Node Feature Discovery Operator to manage and deploy GPU worker nodes to clusters.

GPU nodes on NERC clusters are also managed via taints according to their GPU device. This ensures that only workloads explicitly requesting GPUs will consume GPU resources.

NERC GPU Worker Node Architecture

The NERC OpenShift environment currently supports three different NVIDIA GPU products:

NVIDIA-A100-SXM4-40GB (A100)
NVIDIA-H100-80GB-HBM3 (H100)
Tesla-V100-PCIE-32GB (V100)

NERC GPU Worker Nodes Info

A100 worker nodes also contain 4 individual GPUs, but each has 40 GB of memory. H100 worker nodes contain 4 individual GPUs, each with 80 GB of memory. In contrast, V100 worker nodes contain a single GPU with 32 GB of memory.

Accessing GPU Resources

In order to use GPU resources on your pod, you must specify the number of GPUs you want to use in the "OpenShift Request on GPU Quota" attribute that has been approved for your "NERC-OCP (OpenShift)" resource allocation on NERC's ColdFront as described here.

Deploying Workloads to GPUs

There are two ways to deploy workloads on GPU nodes:

Deploy directly in your OCP namespace:

In your project namespace you can deploy a GPU workload by explicitely requesting a GPU in your pod manifest, see: How to specify pod to use GPU.
Deploy through NERC RHOAI

See Populate the data science project with a Workbench for selecting GPU options.

GPU Accelerator Available on NERC RHOAI

The different options for GPU accelerator are "None", "NVIDIA A100 GPU", "NVIDIA H100 GPU", and "NVIDIA V100 GPU" as shown below: