Deploying a Llama model with KServe using Red Hat OpenShift AI
In this walkthrough, we will explore and demonstrate how to deploy a Llama language model using the intuitive interface of Red Hat OpenShift AI (RHOAI) and NERC's powerful infrastructure features, including GPU acceleration, automatic resource scaling, and support for distributed computing.
Prerequisites:
-
You have enabled the Single-model Serving platform. For more information about enabling the single-model serving platform, see Setting up the Single-model Server platform.
-
Before proceeding, confirm that you have an active GPU quota that has been approved for your current NERC OpenShift Allocation through NERC ColdFront. Read more about How to Access GPU Resources on NERC OpenShift Allocation.
-
Llama-3.2-3B-Instruct-FP8model: Llama-3.2-3B-Instruct-FP8 is obtained by quantizing the weights of the Llama-3.2-3B-Instruct model to the FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, lowering GPU memory requirements by approximately 50% and increasing matrix-multiply throughput by about 2×. Weight quantization also reduces disk storage requirements by roughly 50%.For our Llama model demonstration, we are using a publicly available container image from the Quay.io registry. Specifically, we will deploy the Llama 3.2 model with 3 billion parameters, fine-tuned for instruction-following and optimized with 8-bit floating-point precision to minimize memory usage.
-
Setup the OpenShift CLI (
oc) Tools locally and configure the OpenShift CLI to enableoccommands. Refer to this user guide. -
Helm installed locally.
Establishing model connections
Create a Connection to a ModelCar container image, which is an OCI-compliant container that packages a machine learning model along with its runtime environment and dependencies for consistent deployment.
In your OpenShift AI project, go to the Connections tab and click the "Create Connection" and then choose the URI connection type as shown below:
To create this connection in your project, enter the following URI:
oci://quay.io/jharmison/models:redhatai--llama-3_2-3b-instruct-fp8-modelcar
and use Llama 3.2 3B Modelcar as the connection name, as shown below:
Important Note: ModelCar Requirements & Guidance
You have several options for deploying models to your OpenShift AI cluster. We recommend using ModelCar because it removes the need to manually download models from Hugging Face, upload them to S3, or manage access permissions. With ModelCar, you can package models as OCI images and pull them at runtime or precache them. This simplifies versioning, improves traceability, and integrates cleanly into CI/CD workflows. ModelCar images also ensure reproducibility and maintain versioned model releases.
You can deploy our own model using a ModelCar container, which packages all model files into an OCI container image. To learn more about ModelCar containers, read this article Build and deploy a ModelCar container in OpenShift AI. It explains the benefits of ModelCar containers, how to build a ModelCar image, and how to deploy it with OpenShift AI.
For additional patterns and prebuilt ModelCar images, explore the Red Hat AI Services ModelCar Catalog repository on GitHub. Prebuilt images from this catalog are also available in the ModelCar Catalog registry on Quay. However, note that all these images are compiled for the x86 architecture. If you're targeting ARM, you'll need to rebuild these images on an ARM machine, as demonstrated in this guide.
Additionally, you may find it helpful to read Optimize and deploy LLMs for production with OpenShift AI.
Setting up Single-model Server and Deploy the model
-
In the left menu, click Data science projects.
The Data science projects page opens.
-
Click the name of the project that you want to deploy a model in.
A project details page opens.
-
Click the Models tab.
-
Click the Deploy model button.
-
The Deploy model dialog opens.
Enter the following information for your new model:
-
Model deployment name: Enter a unique name for the model that you are deploying (e.g., "mini-llama-demo").
-
Serving runtime: Select vLLM NVIDIA GPU ServingRuntime for KServe runtime.
-
Model framework (name - version): This is pre-selected as
vLLM. -
Deployment mode: From the Deployment mode list, select Advanced option - uses Knative Serverless.
-
Number of model server replicas to deploy has Minimum replicas:
1and Maximum replicas:1. -
Model server size: This is the amount of resources, CPU, and RAM that will be allocated to your server. Here, you can select
Mediumsize. -
Accelerator: Select
NVIDIA A100 GPU. -
Number of accelerators:
1. -
Model route: Select the checkbox for "Make deployed models available through an external route" this will enable us to send requests to the model endpoint from outside the cluster.
-
Token authentication: Select the checkbox for "Require token authentication" if you want to secure or restrict access to the model by forcing requests to provide an authorization token, which is important for security. While selecting it, you can keep the populated Service account name i.e.
default-name. -
Source model location: Select the Connection option from the dropdown list that you created as described here to store the model by using the Existing connection option Connection dropdown list i.e.
Llama 3.2 3B Modelcar. -
Configuration parameters: You can customize the runtime parameters in the Configuration parameters section. You don't need to add any arguments here.
-
For our example, set the Model deployment name to mini-llama-demo, and select
Serving runtime as vLLM NVIDIA GPU ServingRuntime for KServe. Also, ensure
that the Deployment mode is set to Advanced - uses Knative Serverless.
Please leave the other fields at their default settings. For example, the
Number of model server replicas to deploy has Minimum replicas set to 1
and Maximum replicas set to 1, and the Model server size is set to Medium.
Choose NVIDIA A100 GPU as the Accelerator, with the Number of accelerators
set to 1.
At this point, ensure that both
Make deployed models available through an external route and
Require token authentication are checked. Please leave the populated
Service account name i.e. default-name as it is. Select Llama 3.2 3B Modelcar
as the Connection from the Existing connection, as shown below:
When you are ready to deploy your model, select the Deploy button.
Confirm that the deployed model appears on the Models tab for your project. After some time, once the model has finished deploying, the model deployments page of the dashboard will display a green checkmark in the Status column, indicating that the deployment is complete.
To view details for the deployed model, click the dropdown arrow icon to the left
of your deployed model name (e.g., mini-llama-demo), as shown below:
You can also modify the configure properties for your deployed model configuration by clicking on the three dots on the right side, and selecting Edit. This will bring back the same configuration pop-up window we used earlier. This menu also has the option for you to Delete the deployed model.
Intelligent Auto-Scaling and Scale-to-Zero for Significant Cost Savings
Once you have deployed your model and obtained the inference endpoints, you can modify the model configuration to set the Minimum replicas to 0, then redeploy it as shown below:
This enables intelligent auto-scaling of your model's compute resources
(CPU, GPU, RAM, etc.), allowing replicas to scale up during high traffic and
scale down when idle. With scale-to-zero enabled, the system reduces pods
to zero during inactivity, eliminating idle compute costs - especially beneficial
for GPU workloads. The model then scales back up instantly as soon as a new
request arrives.
Testing your deployment
Internal testing
Once deployed, navigate to Workloads > Pods in the left-hand menu, then locate and click on the pod that corresponds to the model deployment name, as shown below:
Access the pod's terminal by clicking the Terminal tab, then run a curl command to test internal communication.
The vLLM runtime uses OpenAI's API format, making integration straightforward. You can learn more in the OpenAI documentation.
The following is an example command you can use to test the connection:
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello! How can I help you?"},
{"role": "user", "content": "What is 2 plus 2?"}
]
}'
If your command output is successful, it should output something like this:
Testing external access
For external testing, use the token and external endpoint in your curl command.
The deployed model is now accessible through the API endpoint of the model server. The information about the endpoint is different, depending on how you configured the model server.
As in this example, you have exposed the model externally through a route, click on the "Internal and external endpoint details" link in the Inference endpoint section. A popup will display the address for the url and the External (can be accessed from inside or outside the cluster) for the inference endpoints as shown below:
Notes:
-
The internal URL displayed is only the base address of the endpoint of the following format:
https://name-of-your-model.name-of-your-project-namespace.svc.cluster.localthat is accessible only within your cluster locally. -
The External Inference endpoint displays the full URL of the following format:
https://name-of-your-model-name-of-your-project.apps.shift.nerc.mghpcc.orgthat you can be easily accessed from outside the cluster. -
Get the Authorization Token for your deployed model by clicking on dropdown arrow icon to the left of your deployed model name i.e. "granite". Your Authorization Token is located at the "Token authentication" section under "Token secret", you can just copy the token i.e.
YOUR_BEARER_TOKENdirectly from the UI.
The following are some example commands you can use to test the connection:
curl -X POST https://<url>/v1/chat/completions \
-H "Content-Type: application/json" -H "Authorization: Bearer YOUR_BEARER_TOKEN" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello! How can I help you?"},
{"role": "user", "content": "What is 2 plus 2?"}
]
}'
Output:
curl -k -X POST https://<url>/v1/completions \
-H "Content-Type: application/json" -H "Authorization: Bearer YOUR_BEARER_TOKEN" \
-d '{
"model": "name-of-your-model",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0.7
}'
Web interface integration using Open WebUI
For a more user-friendly experience, integrate with Open WebUI as follows:
-
Clone or navigate to this repository.
To get started, clone the repository using:
git clone https://github.com/nerc-project/llm-on-nerc.git cd llm-on-nerc/llm-clients/openwebui/charts/openwebui -
Prepare
values.yamlto connect the Open WebUI to the Deployed vLLM Model.Edit the
values.yamlfile to specify your running vLLM model and external endpoint and token:vllmEndpoint: http://vllm.example.svc:8000/v1 vllmModel: granite-3.3-2b-instruct vllmToken: "" -
Install Helm chart.
Deploy Open WebUI using Helm with your configuration:
helm install openwebui ./ -f values.yamlOutput:
NAME: openwebui LAST DEPLOYED: Tue Dec 2 22:52:06 2025 NAMESPACE: <your-namespace> STATUS: deployed REVISION: 1 DESCRIPTION: Install complete TEST SUITE: None NOTES: 1. Get the Open WebUI URL by running these commands: route_hostname=$(kubectl get --namespace <your-namespace> route openwebui -o jsonpath='{.status.ingress[0].host}') echo https://${route_hostname} -
Access Open WebUI and Test vLLM integration.
Ensure the clean web UI is connected to your vLLM endpoint by sending a simple prompt and verifying the response as shown below:
To Remove the Helm Chart
Run the following command to cleanly uninstall and delete a Helm release:
helm uninstall openwebui











