Use Cases/LLM Inference

LLM inference on GPU

Deploy large language models for real-time inference in under 60 seconds. Velar handles GPU allocation, container builds, and auto-scaling so you can focus on your application — not the infrastructure.

What Velar handles for you

Running LLM inference at production scale requires managing GPU memory, batching requests, handling cold starts, and auto-scaling to demand. Velar abstracts all of this behind a Python decorator.

  • GPU provisioning and deallocation
  • Container builds with your Python dependencies
  • Auto-scaling based on request volume
  • Scale to zero when no requests are in flight
  • Per-second billing — no idle GPU cost
  • Real-time logs and cost tracking

Serving frameworks supported

Velar works with any Python-based serving stack. vLLM and TGI are the most common choices for production LLM inference, but you can use any framework or roll your own.

vLLM

High-throughput inference with PagedAttention. Best for concurrent requests.

Text Generation Inference (TGI)

Hugging Face's optimized serving library. Strong model compatibility.

Transformers pipeline

Quickest to set up. Good for prototyping and lower-traffic endpoints.

Custom serving

Bring your own FastAPI, gRPC, or any HTTP server.

Code examples

Deploy in under 60 seconds with any serving framework.

With vLLM

inference.py
import velar

app = velar.App("llm-serving")

image = velar.Image.from_registry(
    "pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("vllm")

@app.function(gpu="A100", image=image)
def chat(prompt: str, max_tokens: int = 512):
    from vllm import LLM, SamplingParams
    llm = LLM(model="meta-llama/Llama-2-13b-chat-hf")
    params = SamplingParams(max_tokens=max_tokens)
    output = llm.generate([prompt], params)
    return output[0].outputs[0].text

app.deploy()
response = chat.remote("Explain attention mechanisms in transformers")

With TGI

tgi.py
import velar

app = velar.App("tgi-serving")

image = velar.Image.from_registry(
    "ghcr.io/huggingface/text-generation-inference:latest"
)

@app.function(gpu="H100", image=image)
def generate(prompt: str):
    import requests
    resp = requests.post(
        "http://localhost:8080/generate",
        json={"inputs": prompt, "parameters": {"max_new_tokens": 256}}
    )
    return resp.json()["generated_text"]

GPU recommendations by model

LLM inference is primarily VRAM-bound. Use this as a starting point.

ModelRecommended GPUNotes
Llama 3 (8B / 70B)A100 80GBRecommended for production
Mistral 7B / Mixtral 8x7BA100 80GBFast, cost-effective
Llama 2 (7B / 13B)L4 24GB / A100Good for fine-tuned variants
Phi-3 / GemmaL4 24GBLightweight, lower cost
GPT-NeoX / GPT-JRTX 4090Open weights, custom use

See full GPU specs and pricing at velar.run/pricing.

Deploy your first LLM today

$10 in free GPU credits. No credit card required.