vLLM Deployment Tutorial: From Local to Cloud GPU
vLLM is the leading open-source LLM inference engine. It achieves state-of-the-art throughput through PagedAttention — a technique that manages KV cache memory like virtual memory in an OS, eliminating fragmentation and enabling efficient concurrent request handling.
This tutorial covers deploying vLLM on a cloud GPU: model selection, configuration, concurrency tuning, and cost optimization.
Why vLLM Over a Plain Transformers Pipeline
When you run inference with Hugging Face Transformers directly, each request processes sequentially. Under load, requests queue up and GPU utilization drops.
vLLM solves this with two key features:
- Continuous batching — new requests join the batch mid-generation. A request that starts generating doesn't block incoming requests.
- PagedAttention — KV cache is allocated in fixed blocks, like virtual memory pages. This allows 2-4x more concurrent requests for the same VRAM.
In practice, vLLM achieves 20-40x higher throughput than naive Transformers for concurrent workloads.
Basic vLLM Deployment
Here's the minimal setup to deploy any vLLM-compatible model:
import velar
app = velar.App("vllm-inference")
image = velar.Image.from_registry(
"pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("vllm==0.4.0")
MODEL = "mistralai/Mistral-7B-Instruct-v0.2"
@app.function(gpu="L4", image=image, timeout=600)
def generate(
prompt: str,
max_tokens: int = 256,
temperature: float = 0.7,
) -> str:
from vllm import LLM, SamplingParams
llm = LLM(
model=MODEL,
dtype="float16",
max_model_len=8192,
gpu_memory_utilization=0.90,
)
params = SamplingParams(
temperature=temperature,
max_tokens=max_tokens,
stop=["</s>", "[INST]"],
)
output = llm.generate([prompt], params)
return output[0].outputs[0].textModel Loading Optimization
By default, vLLM downloads the model on every cold start. For large models this takes 5-10 minutes. The fix is to use a persistent volume:
import velar
import os
app = velar.App("vllm-inference")
image = velar.Image.from_registry(
"pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("vllm==0.4.0", "huggingface_hub")
MODEL = "mistralai/Mistral-7B-Instruct-v0.2"
MODEL_CACHE = "/data/models"
@app.function(
gpu="L4",
image=image,
timeout=600,
volume_size_gb=50, # Persistent /data volume
secrets={"HF_TOKEN": "hf_your_token"},
)
def generate(prompt: str, max_tokens: int = 256) -> str:
from vllm import LLM, SamplingParams
from huggingface_hub import snapshot_download
import os
model_path = os.path.join(MODEL_CACHE, MODEL.replace("/", "--"))
# Download once, reuse on subsequent calls
if not os.path.exists(model_path):
snapshot_download(
repo_id=MODEL,
local_dir=model_path,
token=os.environ["HF_TOKEN"],
)
llm = LLM(model=model_path, dtype="float16")
params = SamplingParams(max_tokens=max_tokens)
output = llm.generate([prompt], params)
return output[0].outputs[0].textThe volume_size_gb=50 parameter mounts a persistent disk at /data. The model downloads once and is available on all future calls, cutting cold start from 5+ minutes to under 60 seconds.
Chat Template Support
Instruct models expect a specific prompt format. Using the wrong format degrades output quality significantly. vLLM handles this via the tokenizer's built-in chat template:
@app.function(gpu="L4", image=image, timeout=600)
def chat(messages: list[dict], max_tokens: int = 512) -> str:
"""
messages: [{"role": "user", "content": "Hello"}, ...]
"""
from vllm import LLM, SamplingParams
llm = LLM(model=MODEL, dtype="float16")
tokenizer = llm.get_tokenizer()
# Apply the model's built-in chat template
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
params = SamplingParams(
temperature=0.7,
max_tokens=max_tokens,
stop_token_ids=[tokenizer.eos_token_id],
)
output = llm.generate([prompt], params)
return output[0].outputs[0].text
# Usage:
# result = chat.remote([
# {"role": "system", "content": "You are a helpful assistant."},
# {"role": "user", "content": "What is PagedAttention?"},
# ])GPU and Model Sizing Guide
| Model | GPU | VRAM Used | Max Concurrent (est.) | Cost/hr |
|---|---|---|---|---|
| Mistral 7B | L4 24 GB | ~18 GB | 8-12 | $0.66 |
| Llama 3 8B | L4 24 GB | ~20 GB | 6-10 | $0.66 |
| Mixtral 8x7B | A100 80 GB | ~48 GB | 16-24 | $2.36 |
| Llama 3 70B | 2× A100 80 GB | ~140 GB | 8-12 | $4.72 |
Key vLLM Parameters to Tune
These parameters have the biggest impact on performance and cost:
gpu_memory_utilization(default 0.90) — fraction of GPU VRAM allocated to KV cache. Set higher (0.95) for more concurrency, lower if you see OOM errors.max_model_len— maximum sequence length. Setting this lower than the model default reduces KV cache size and allows more concurrent requests.tensor_parallel_size— number of GPUs for tensor parallelism. Set to match your GPU count for models that don't fit on one GPU.dtype— usefloat16for best speed/quality balance.bfloat16is better for training but similar for inference.int4quantization halves VRAM but slightly reduces quality.
Deploying and Calling from Your App
Once deployed, you can call the function from any Python process that has the Velar SDK installed:
import velar
# In your web server, worker, or script
result = generate.remote(
"Summarize the following in 3 bullet points: ...",
max_tokens=256,
)
print(result)Or wrap it in a FastAPI app for an OpenAI-compatible endpoint:
from fastapi import FastAPI
from pydantic import BaseModel
from vllm_serve import generate # your Velar function
api = FastAPI()
class ChatRequest(BaseModel):
prompt: str
max_tokens: int = 256
@api.post("/v1/generate")
async def complete(req: ChatRequest):
result = generate.remote(req.prompt, req.max_tokens)
return {"text": result}Next Steps
- Add streaming with vLLM's
AsyncLLMEnginefor token-by-token responses - Try quantized models (AWQ, GPTQ) to fit larger models on smaller GPUs
- Use
gpu_count=2andtensor_parallel_size=2for 70B models - Monitor throughput with vLLM's built-in metrics endpoint (
/metricsin the OpenAI-compatible server mode)