Serving Text Generation Inference Service (TGIS) and FLAN-T5 Small Model

Prerequisites:

FLAN-T5 small model: Google's FLAN-T5 Small is a lightweight model known for its compatibility and low resource requirements (no GPU needed). Although it's not a high-performing model, the same process applies to any compatible model. FLAN-T5 Small is based on the pretrained T5 model and has been fine-tuned using instruction data to improve zero-shot and few-shot NLP tasks such as reasoning and question answering.

Procedure:

At a high level, we will:

Download a model from Huggingface
Deploy the model by using single-model serving with a serving runtime
Test the model API

Set up local S3 storage (MinIO) and Connection

Navigating to the OpenShift AI dashboard.

Please follow these steps to access the NERC OpenShift AI dashboard.
Using a script to set up local S3 storage (MinIO) on your Data Science Project in the NERC RHOAI as described here.
Once your local S3 Object storage using MinIO is completed, you can browse to the MinIO Web Console using the provided URL. Enter the Access Key as the Username and the Secret Key as the Password. This will open the Object Browser, where you should verify that the bucket: my-storage is visible as shown below:
Click Connections. You should see one connection listed: My Storage as shown below:

Creating a workbench and a notebook

Procedure:

Prepare your Jupyter notebook server for using No GPU, you need to have:

Select the correct data science project and create workbench, see Populate the data science project for more information.

Please ensure that you start your Jupyter notebook server with options as depicted in the following configuration screen. This screen provides you with the opportunity to select a notebook image and configure its options. As we do not require any GPU resources to run FLAN-T5 small model, we can leave the Accelerator field set to its default None selection.

Click Attach existing connections under the Connections section, and attach the "My Storage" connection that was set up previously to the workbench:

Search and add "My Storage":

Click on "Attach" button:

The final workbench setup, before clicking the Create workbench button, should look like this:

For our example project, let's name it "TGIS Workbench". We'll select the Standard Data Science image with Recommended Version (selected by default), choose a Deployment size of Small, Accelerator as None (no GPU is needed for this setup), and allocate a Cluster storage space of 20GB (Selected By Default).

Verification:

If this procedure is successful, you have started your Jupyter notebook server. When your workbench is ready and the status changes to Running, click the open icon () next to your workbench's name, or click the workbench name directly to access your environment:

Note

If you made a mistake, you can edit the workbench to make changes. Please make sure you set the Running status of your workbench to Stopped prior clicking the action menu (⋮) at the end of the selected workbench row as shown below:

Once you have successfully authenticated by clicking "mss-keycloak" when prompted, as shown below:

Next, you should see the NERC RHOAI JupyterLab Web Interface, as shown below:

The Jupyter environment is currently empty. To begin, populate it with content using Git. On the left side of the navigation pane, locate the Name explorer panel, where you can create and manage your project directories.

Learn More About Working with Notebooks

For detailed guidance on using notebooks on NERC RHOAI JupyterLab, please refer to this documentation.

Importing the tutorial files into the Jupyter environment

Bring the content of this tutorial inside your Jupyter environment:

On the toolbar, click the Git Clone icon:

Enter the following Git Repo URL: https://github.com/nerc-project/llm-on-nerc

Check the Include submodules option, and then click Clone.

In the file browser, double-click the newly-created llm-on-nerc folder.

Verification:

In the file browser, you should see the notebooks that you cloned from Git.

Downloading Model

Prerequisites:

First, let's navigate to the relevant notebooks.

Navigate to llm-on-nerc/llm/tgis
In your notebook environment, open the file 1_download_save.ipynb
Follow the instructions directly in the notebook.
The instructions will guide you through downloading the model from Hugging Face and uploading it to your models bucket, which is located within your main bucket mapped through the Connection.

Verification:

When you have completed the notebook instructions, you should see files listed in the directory/prefix: models/flan-t5-small

models/flan-t5-small/README.md
models/flan-t5-small/config.json
models/flan-t5-small/flax_model.msgpack
models/flan-t5-small/generation_config.json
models/flan-t5-small/model.safetensors
models/flan-t5-small/pytorch_model.bin
models/flan-t5-small/special_tokens_map.json
models/flan-t5-small/spiece.model
models/flan-t5-small/tf_model.h5
models/flan-t5-small/tokenizer.json
models/flan-t5-small/tokenizer_config.json

Setting up Single-model Server and Deploy the model

In the left menu, click Data science projects.

The Data science projects page opens.
Click the name of the project that you want to deploy a model in.

A project details page opens.
Click the Models tab.
Perform one of the following actions:
- If you see a Single-model serving platform tile, click Select single-model on the tile and then click the Deploy model button.
- If you do not see any tiles i.e. "Single-model serving platform" is already selected, click the Deploy model button.
The Deploy model dialog opens.

Enter the following information for your new model:

Model deployment name: Enter a unique name for the model that you are deploying (e.g., "flan-t5-small").
Serving runtime: Select TGIS Standalone ServingRuntime for KServe runtime.
Model framework (name - version): This is pre-selected as pytorch.
Deployment mode: From the Deployment mode list, select Advanced option.
Number of model server replicas to deploy has Minimum replicas: 1 and Maximum replicas:1.
Model server size: This is the amount of resources, CPU, and RAM that will be allocated to your server. Here, you can select Small size.
Accelerator: Select None.
Model route: Select the checkbox for "Make deployed models available through an external route" this will enable us to send requests to the model endpoint from outside the cluster.
Token authentication: Select the checkbox for "Require token authentication" if you want to secure or restrict access to the model by forcing requests to provide an authorization token, which is important for security. While selecting it, you can keep the populated Service account name i.e. default-name.
Source model location:

i. Select the Connection option from the dropdown list that you created as described here to store the model by using the Existing connection option Connection dropdown list i.e. My Storage.

Alternatively, you can create a new connection directly from this menu by selecting Create connection option.

ii. Path: If your model is not located at the root of the bucket of your connection, you must enter the path to the folder it is in i.e. models/flan-t5-small.
Configuration parameters: You can customize the runtime parameters in the Configuration parameters section. You don't need to add any arguments here.

For our example, set the Model deployment name to flan-t5-small, and select Serving runtime as TGIS Standalone ServingRuntime for KServe. Also, ensure that the Deployment mode is set to Advanced.

Please leave the other fields at their default settings. For example, the Number of model server replicas to deploy has Minimum replicas set to 1 and Maximum replicas set to 1, Model server size set to Small, and Accelerator set to None.

At this point, ensure that both Make deployed models available through an external route and Require token authentication are unchecked. Select My Storage as the Connection from the Existing connection, and for the model Path location, enter models/flan-t5-small as the folder path, as shown below:

When you are ready to deploy your model, select the Deploy button.

Confirm that the deployed model appears on the Models tab for your project. After some time, once the model has finished deploying, the model deployments page of the dashboard will display a green checkmark in the Status column, indicating that the deployment is complete.

To view details for the deployed model, click the dropdown arrow icon to the left of your deployed model name (e.g., flan-t5-small), as shown below:

You can also modify the configure properties for your deployed model configuration by clicking on the three dots on the right side, and selecting Edit. This will bring back the same configuration pop-up window we used earlier. This menu also has the option for you to Delete the deployed model.

Check the Model API

The deployed model is now accessible through the API endpoint of the model server. The information about the endpoint is different, depending on how you configured the model server.

As in this example, you have NOT exposed the model externally through a route, click on the "Internal and external endpoint details" link in the Inference endpoint section. A popup will display the internal address for the url for the inference endpoint as shown below:

Notes:

The internal URL displayed is only the base address of the endpoint of the following format: https://name-of-your-model.name-of-your-project-namespace.svc.cluster.local that is accessible only within your cluster locally.

Testing the model API

Now that you've deployed the model, you can test its API endpoints.

Return to the Jupyter environment.
Open the file called 2_grpc_request.ipynb.
Read the code and follow the instructions.

Summary

Deploying validated models from Red Hat AI's Hugging Face Validated Models repository in disconnected OpenShift AI environments involves the following steps:

Set up local S3 storage (MinIO) and create a connection to point to the bucket.
Select the desired model.
Download the model and upload it to the S3 storage bucket.
Identify the required serving runtime.
Configure a single-model server and deploy the model using the connection.
Verify and test the model's API inference endpoints.

This process ensures that AI workloads run seamlessly in restricted or disconnected environments, allowing you to securely leverage validated and optimized AI models.