Deploy and Test the Base Model

Deploy the Base Model with VLLM

Ready to deploy your model? Let’s get started! Follow these steps to bring your model to life in the Data Science Project (userX):

  1. Navigate to Your Project: Head over to your created Data Science Project and locate the Models section.

  2. Select the Service Platform: Click on Models and choose the Single-model service platform option.

    04 01 single model
  3. Deploy Your Model: Click on the Deploy model button to start the deployment process.

    04 01 deploy
  4. Fill Out the Deployment Form: You’ll need to provide some essential information. Here’s what to enter:

    • Name: base

    • Serving runtime: vLLM ServingRuntime for KServe

    • Model server size: Small

    • Accelerator: NVIDIA GPU

    • Model route: Select the option to make your model available through an external route.

    • Token authentication: Choose Require token authentication and leave the default Service account name.

    • Existing connection:

      • Connection: Minio - models

      • Path: base_model

        04 01 model inputs
  5. Deploy and Wait: After filling out the form, click on Deploy. Now, wait while your model gets ready. This might take a moment! ☕

    04 01 model status

Test the Base Model

Congratulations on successfully deploying your model! 🎉 Now, it’s time to put it to the test. Get ready to send a request to your model and measure its response time. Let’s dive in!

Workbench Setup

Similarly to Section 2.3, go to Data Science Projects, select your previously created project (userX), and:

  • Click on Create a workbench to create a new workbench, named terminal. This time without attaching any GPU. Also, the small size should be enough.

    04 wb list
    04 wb creation
    04 wb ready
  • Open the terminal workbench and create a terminal inside.

    04 wb terminal
  • Clone the repository https://github.com/luis5tb/neural-magic-workshop.git and go to the neural-magic-workshop/lab-materials/04 folder.

    04 wb clone
  • Update Your Variables: Open the request.py file and update the following variables to match your setup:

    04 wb request
MODEL = "your-model-name"  # Replace with your model name
URL = "your-api-url"       # Replace with your API endpoint
API_KEY = "your-api-key"   # Replace with your API key

To fill in these variables, use the information from your deployed model:

  • Set MODEL to the name of your model (base).

  • For the URL, check the internal and external endpoints details of your deployed model and use the external endpoint.

    04 01 inference endpoints
  • Copy the model server token for the API_KEY.

    04 01 inference token

Install the Required Dependency: Go back to the created terminal and install the necessary package to interact with your model:

pip install langchain_openai

Running the Script

You’re almost there! To run the script and measure its execution time, simply execute the following command in the terminal:

time python request.py

Once you run the script, you’ll see some exciting output, including:

  • The script’s output

  • Real time (wall clock time)

  • User CPU time

  • System CPU time

    04 request base

This is your chance to see how well your model performs! 🚀

Remove model

When you’re done testing, don’t forget to clean up. Simply click on the Delete button in the Models tab to remove the model.

🚨 Make sure to remove the model before proceeding to the next step to ensure you have enough GPUs available for your next tasks.
04 01 model delete
04 01 model delete base