Use Cases/LLM Inference

LLM inference on GPU

Deploy large language models for real-time inference in under 60 seconds. Velar handles GPU allocation, container builds, and auto-scaling so you can focus on your application — not the infrastructure.

Deploy your first LLM Read the docs

What Velar handles for you

Running LLM inference at production scale requires managing GPU memory, batching requests, handling cold starts, and auto-scaling to demand. Velar abstracts all of this behind a Python decorator.

GPU provisioning and deallocation
Container builds with your Python dependencies
Auto-scaling based on request volume
Scale to zero when no requests are in flight
Per-second billing — no idle GPU cost
Real-time logs and cost tracking

Serving frameworks supported

Velar works with any Python-based serving stack. vLLM and TGI are the most common choices for production LLM inference, but you can use any framework or roll your own.

vLLM

High-throughput inference with PagedAttention. Best for concurrent requests.

Text Generation Inference (TGI)

Hugging Face's optimized serving library. Strong model compatibility.

Transformers pipeline

Quickest to set up. Good for prototyping and lower-traffic endpoints.

Custom serving

Bring your own FastAPI, gRPC, or any HTTP server.

Code examples

Deploy in under 60 seconds with any serving framework.

With vLLM

inference.py

import velar

app = velar.App("llm-serving")

image = velar.Image.from_registry(
    "pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("vllm")

@app.function(gpu="A100", image=image)
def chat(prompt: str, max_tokens: int = 512):
    from vllm import LLM, SamplingParams
    llm = LLM(model="meta-llama/Llama-2-13b-chat-hf")
    params = SamplingParams(max_tokens=max_tokens)
    output = llm.generate([prompt], params)
    return output[0].outputs[0].text

app.deploy()
response = chat.remote("Explain attention mechanisms in transformers")

With TGI

tgi.py

import velar

app = velar.App("tgi-serving")

image = velar.Image.from_registry(
    "ghcr.io/huggingface/text-generation-inference:latest"
)

@app.function(gpu="H100", image=image)
def generate(prompt: str):
    import requests
    resp = requests.post(
        "http://localhost:8080/generate",
        json={"inputs": prompt, "parameters": {"max_new_tokens": 256}}
    )
    return resp.json()["generated_text"]

GPU recommendations by model

LLM inference is primarily VRAM-bound. Use this as a starting point.

Model	Recommended GPU	Notes
Llama 3 (8B / 70B)	A100 80GB	Recommended for production
Mistral 7B / Mixtral 8x7B	A100 80GB	Fast, cost-effective
Llama 2 (7B / 13B)	L4 24GB / A100	Good for fine-tuned variants
Phi-3 / Gemma	L4 24GB	Lightweight, lower cost
GPT-NeoX / GPT-J	RTX 4090	Open weights, custom use

See full GPU specs and pricing at velar.run/pricing.

Related use cases

Model Fine-Tuning

Train your own LLM on custom data.

Batch Processing

Generate embeddings for large datasets.

Image Generation

Run diffusion models at scale.

Deploy your first LLM today

$10 in free GPU credits. No credit card required.

Get Started Free