vLLM Deployment Tutorial: From Local to Cloud GPU

vLLM is the leading open-source LLM inference engine. It achieves state-of-the-art throughput through PagedAttention — a technique that manages KV cache memory like virtual memory in an OS, eliminating fragmentation and enabling efficient concurrent request handling.

This tutorial covers deploying vLLM on a cloud GPU: model selection, configuration, concurrency tuning, and cost optimization.

Why vLLM Over a Plain Transformers Pipeline

When you run inference with Hugging Face Transformers directly, each request processes sequentially. Under load, requests queue up and GPU utilization drops.

vLLM solves this with two key features:

Continuous batching — new requests join the batch mid-generation. A request that starts generating doesn't block incoming requests.
PagedAttention — KV cache is allocated in fixed blocks, like virtual memory pages. This allows 2-4x more concurrent requests for the same VRAM.

In practice, vLLM achieves 20-40x higher throughput than naive Transformers for concurrent workloads.

Basic vLLM Deployment

Here's the minimal setup to deploy any vLLM-compatible model:

vllm_serve.py

import velar

app = velar.App("vllm-inference")

image = velar.Image.from_registry(
    "pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("vllm==0.4.0")

MODEL = "mistralai/Mistral-7B-Instruct-v0.2"

@app.function(gpu="L4", image=image, timeout=600)
def generate(
    prompt: str,
    max_tokens: int = 256,
    temperature: float = 0.7,
) -> str:
    from vllm import LLM, SamplingParams

    llm = LLM(
        model=MODEL,
        dtype="float16",
        max_model_len=8192,
        gpu_memory_utilization=0.90,
    )
    params = SamplingParams(
        temperature=temperature,
        max_tokens=max_tokens,
        stop=["</s>", "[INST]"],
    )
    output = llm.generate([prompt], params)
    return output[0].outputs[0].text

Model Loading Optimization

By default, vLLM downloads the model on every cold start. For large models this takes 5-10 minutes. The fix is to use a persistent volume:

vllm_with_volume.py

import velar
import os

app = velar.App("vllm-inference")

image = velar.Image.from_registry(
    "pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("vllm==0.4.0", "huggingface_hub")

MODEL = "mistralai/Mistral-7B-Instruct-v0.2"
MODEL_CACHE = "/data/models"

@app.function(
    gpu="L4",
    image=image,
    timeout=600,
    volume_size_gb=50,           # Persistent /data volume
    secrets={"HF_TOKEN": "hf_your_token"},
)
def generate(prompt: str, max_tokens: int = 256) -> str:
    from vllm import LLM, SamplingParams
    from huggingface_hub import snapshot_download
    import os

    model_path = os.path.join(MODEL_CACHE, MODEL.replace("/", "--"))

    # Download once, reuse on subsequent calls
    if not os.path.exists(model_path):
        snapshot_download(
            repo_id=MODEL,
            local_dir=model_path,
            token=os.environ["HF_TOKEN"],
        )

    llm = LLM(model=model_path, dtype="float16")
    params = SamplingParams(max_tokens=max_tokens)
    output = llm.generate([prompt], params)
    return output[0].outputs[0].text

The volume_size_gb=50 parameter mounts a persistent disk at /data. The model downloads once and is available on all future calls, cutting cold start from 5+ minutes to under 60 seconds.

Chat Template Support

Instruct models expect a specific prompt format. Using the wrong format degrades output quality significantly. vLLM handles this via the tokenizer's built-in chat template:

vllm_chat.py

@app.function(gpu="L4", image=image, timeout=600)
def chat(messages: list[dict], max_tokens: int = 512) -> str:
    """
    messages: [{"role": "user", "content": "Hello"}, ...]
    """
    from vllm import LLM, SamplingParams

    llm = LLM(model=MODEL, dtype="float16")
    tokenizer = llm.get_tokenizer()

    # Apply the model's built-in chat template
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    params = SamplingParams(
        temperature=0.7,
        max_tokens=max_tokens,
        stop_token_ids=[tokenizer.eos_token_id],
    )
    output = llm.generate([prompt], params)
    return output[0].outputs[0].text


# Usage:
# result = chat.remote([
#     {"role": "system", "content": "You are a helpful assistant."},
#     {"role": "user", "content": "What is PagedAttention?"},
# ])

GPU and Model Sizing Guide

Model	GPU	VRAM Used	Max Concurrent (est.)	Cost/hr
Mistral 7B	L4 24 GB	~18 GB	8-12	$0.66
Llama 3 8B	L4 24 GB	~20 GB	6-10	$0.66
Mixtral 8x7B	A100 80 GB	~48 GB	16-24	$2.36
Llama 3 70B	2× A100 80 GB	~140 GB	8-12	$4.72

Key vLLM Parameters to Tune

These parameters have the biggest impact on performance and cost:

gpu_memory_utilization (default 0.90) — fraction of GPU VRAM allocated to KV cache. Set higher (0.95) for more concurrency, lower if you see OOM errors.
max_model_len — maximum sequence length. Setting this lower than the model default reduces KV cache size and allows more concurrent requests.
tensor_parallel_size — number of GPUs for tensor parallelism. Set to match your GPU count for models that don't fit on one GPU.
dtype — use float16 for best speed/quality balance. bfloat16 is better for training but similar for inference. int4 quantization halves VRAM but slightly reduces quality.

Deploying and Calling from Your App

Once deployed, you can call the function from any Python process that has the Velar SDK installed:

app.py

import velar

# In your web server, worker, or script
result = generate.remote(
    "Summarize the following in 3 bullet points: ...",
    max_tokens=256,
)
print(result)

Or wrap it in a FastAPI app for an OpenAI-compatible endpoint:

api.py

from fastapi import FastAPI
from pydantic import BaseModel
from vllm_serve import generate  # your Velar function

api = FastAPI()

class ChatRequest(BaseModel):
    prompt: str
    max_tokens: int = 256

@api.post("/v1/generate")
async def complete(req: ChatRequest):
    result = generate.remote(req.prompt, req.max_tokens)
    return {"text": result}

Next Steps

Add streaming with vLLM's AsyncLLMEngine for token-by-token responses
Try quantized models (AWQ, GPTQ) to fit larger models on smaller GPUs
Use gpu_count=2 and tensor_parallel_size=2 for 70B models
Monitor throughput with vLLM's built-in metrics endpoint (/metrics in the OpenAI-compatible server mode)