Skip to content

Index

Deploy the LLaMA model with vLLM Runtime

Serving LLM models can be surprisingly slow even on high end GPUs, vLLM is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers. It supports continuous batching for increased throughput and GPU utilization, paged attention to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens.

You can deploy the LLaMA model with built vLLM inference server container image using the InferenceService yaml API spec. We have work in progress integrating vLLM with Open Inference Protocol and KServe observability stack.

The LLaMA model can be downloaded from huggingface and upload to your cloud storage.

kubectl apply -n kserve-test -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-2-7b
spec:
  predictor:
    containers:
    - args:
        - --port
        - "8080"
        - --model
        - /mnt/models
      command:
        - python3
        - -m
        - vllm.entrypoints.openai.api_server
      env:
        - name: STORAGE_URI
          value: gs://kfserving-examples/llm/huggingface/llama
      image: kserve/vllmserver:latest
      name: kserve-container
      resources:
        limits:
          cpu: "4"
          memory: 50Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "1"
          memory: 50Gi
          nvidia.com/gpu: "1"

Warning

vLLM runtime is still experimental, please expect API changes and further integration in the next KServe release.

kubectl apply -f ./vllm.yaml

Benchmarking vLLM Runtime

You can download the benchmark testing data set by running

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

The tokenizer can be found from the downloaded llama model.

Now, assuming that your ingress can be accessed at ${INGRESS_HOST}:${INGRESS_PORT} or you can follow this instruction to find out your ingress IP and port.

You can run the benchmarking script and send the inference request to the exposed URL.

python benchmark_serving.py --backend openai --port ${INGRESS_PORT} --host ${INGRESS_HOST} --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer ./tokenizer --request-rate 5

Expected Output

   Total time: 216.81 s
   Throughput: 4.61 requests/s
   Average latency: 7.96 s
   Average latency per token: 0.02 s
   Average latency per output token: 0.04 s
Back to top