Introduction to GPUs on NERC OpenShift
NERC OCP clusters leverage the NVIDIA GPU Operator as well as the Node Feature Discovery Operator to manage and deploy GPU worker nodes to clusters.
GPU nodes on NERC clusters are also managed via taints according to their GPU device. This ensures that only workloads explicitly requesting GPUs will consume GPU resources.
NERC GPU Worker Node Architecture
The NERC OpenShift environment currently supports three different NVIDIA GPU products:
- NVIDIA-A100-SXM4-40GB (A100)
- NVIDIA-H100-80GB-HBM3 (H100)
- Tesla-V100-PCIE-32GB (V100)
NERC GPU Worker Nodes Info
A100 worker nodes also contain 4 individual GPUs, but each has 40 GB of memory. H100 worker nodes contain 4 individual GPUs, each with 80 GB of memory. In contrast, V100 worker nodes contain a single GPU with 32 GB of memory.
Accessing GPU Resources
In order to use GPU resources on your pod, you must specify the number of GPUs you want to use in the "OpenShift Request on GPU Quota" attribute that has been approved for your "NERC-OCP (OpenShift)" resource allocation on NERC's ColdFront as described here.
Deploying Workloads to GPUs
There are two ways to deploy workloads on GPU nodes:
-
Deploy directly in your OCP namespace:
In your project namespace you can deploy a GPU workload by explicitely requesting a GPU in your pod manifest, see: How to specify pod to use GPU.
-
Deploy through NERC RHOAI
See Populate the data science project with a Workbench for selecting GPU options.