Deploy large language models for real-time inference in under 60 seconds. Velar handles GPU allocation, container builds, and auto-scaling so you can focus on your application — not the infrastructure.
Running LLM inference at production scale requires managing GPU memory, batching requests, handling cold starts, and auto-scaling to demand. Velar abstracts all of this behind a Python decorator.
Velar works with any Python-based serving stack. vLLM and TGI are the most common choices for production LLM inference, but you can use any framework or roll your own.
vLLM
High-throughput inference with PagedAttention. Best for concurrent requests.
Text Generation Inference (TGI)
Hugging Face's optimized serving library. Strong model compatibility.
Transformers pipeline
Quickest to set up. Good for prototyping and lower-traffic endpoints.
Custom serving
Bring your own FastAPI, gRPC, or any HTTP server.
Deploy in under 60 seconds with any serving framework.
With vLLM
import velar
app = velar.App("llm-serving")
image = velar.Image.from_registry(
"pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("vllm")
@app.function(gpu="A100", image=image)
def chat(prompt: str, max_tokens: int = 512):
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-13b-chat-hf")
params = SamplingParams(max_tokens=max_tokens)
output = llm.generate([prompt], params)
return output[0].outputs[0].text
app.deploy()
response = chat.remote("Explain attention mechanisms in transformers")With TGI
import velar
app = velar.App("tgi-serving")
image = velar.Image.from_registry(
"ghcr.io/huggingface/text-generation-inference:latest"
)
@app.function(gpu="H100", image=image)
def generate(prompt: str):
import requests
resp = requests.post(
"http://localhost:8080/generate",
json={"inputs": prompt, "parameters": {"max_new_tokens": 256}}
)
return resp.json()["generated_text"]LLM inference is primarily VRAM-bound. Use this as a starting point.
| Model | Recommended GPU | Notes |
|---|---|---|
| Llama 3 (8B / 70B) | A100 80GB | Recommended for production |
| Mistral 7B / Mixtral 8x7B | A100 80GB | Fast, cost-effective |
| Llama 2 (7B / 13B) | L4 24GB / A100 | Good for fine-tuned variants |
| Phi-3 / Gemma | L4 24GB | Lightweight, lower cost |
| GPT-NeoX / GPT-J | RTX 4090 | Open weights, custom use |
See full GPU specs and pricing at velar.run/pricing.