Serverless GPU Inference: How It Works and When to Use It
“Serverless GPU” is becoming a common term in the ML infrastructure space, but it means different things to different vendors. This post explains what it actually means technically, the tradeoffs you should understand, and how to decide when it makes sense for your workload.
What “Serverless” Means for GPUs
In traditional cloud computing, serverless means you write a function, upload it, and the cloud provider handles provisioning. You pay per invocation. AWS Lambda is the canonical example.
Serverless GPU works the same way conceptually, but with a critical difference: GPUs can't cold-start in milliseconds. Loading a PyTorch model onto a GPU takes seconds to minutes depending on model size. So serverless GPU infrastructure has to solve a harder problem.
The key insight is that “serverless” in the GPU context means you don't manage servers — not necessarily that every invocation starts from scratch. The platform handles:
- Provisioning GPU instances on demand
- Keeping instances warm between calls (configurable)
- Scaling to zero when idle
- Billing only for actual compute time
How Cold Starts Work
A cold start on a GPU involves several steps:
- Instance provisioning — the cloud provider allocates a GPU machine. This is typically 5-30 seconds.
- Container pull — your Docker image is pulled from the registry to the machine. For large ML images (10-30 GB), this can take 1-5 minutes on first deploy. Subsequent calls use the cached image.
- Model loading — your Python code loads model weights from disk or downloads them. A 7B model in float16 is ~14 GB and takes 10-30 seconds to load into GPU memory.
The total cold start time for a typical LLM inference deployment is 1-3 minutes. This is why serverless GPU providers use image caching and warm pools to minimize how often cold starts happen.
import velar
app = velar.App("inference")
# Image caching: Velar skips rebuild if Dockerfile is unchanged
image = velar.Image.from_registry(
"pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("transformers", "accelerate")
@app.function(gpu="L4", image=image)
def predict(text: str) -> str:
# Model loading happens once per container lifecycle
# Velar keeps the container warm between calls
from transformers import pipeline
pipe = pipeline("text-generation", model="gpt2", device=0)
return pipe(text, max_new_tokens=100)[0]["generated_text"]Serverless vs. Dedicated GPU Instances
The right choice depends on your traffic pattern. Here's how to think about it:
| Metric | Serverless GPU | Dedicated Instance |
|---|---|---|
| Idle cost | $0 (scales to zero) | Full hourly rate |
| First request latency | 1-3 min (cold start) | <1 second |
| Warm request latency | Same as dedicated | Same as serverless |
| Peak scaling | Automatic, horizontal | Manual or auto-scaling groups |
| Ops overhead | None | High (patching, monitoring, etc.) |
| Break-even utilization | ~40-60% GPU utilization | Must stay high to justify cost |
When Serverless GPU Wins
Serverless is the right choice when your GPU isn't running continuously:
- Variable traffic — if your API gets 100 req/min at peak and 0 req/min overnight, a dedicated instance wastes 8+ hours of GPU time daily.
- Batch jobs — one-off processing tasks (embeddings, transcription, data labeling) benefit massively from serverless. You spin up 50 GPU workers, process your dataset in parallel, and pay for only the time used.
- Development and staging — running a GPU instance 24/7 for testing burns budget. Serverless means you only pay when actually running.
- Spiky workloads — product launches, viral moments, scheduled jobs at midnight.
When Dedicated Wins
Dedicated instances are better when latency requirements are strict and traffic is constant:
- Real-time UX — if a user is waiting for a response and your model is cold, a 90-second cold start is unacceptable. Use a warm dedicated instance or a min-instance keep-alive.
- High sustained throughput — if your GPU is running at 80%+ utilization all day, a reserved instance is cheaper.
- Fine-tuning runs — long training jobs (hours to days) are better on spot or on-demand reserved instances, not serverless.
The Practical Rule of Thumb
If your GPU would be idle more than 40% of the time, serverless is cheaper. If it's near-fully utilized, dedicated wins.
Most early-stage AI products and internal tools start with serverless. As traffic grows and becomes predictable, you can migrate hot-path inference to dedicated instances while keeping batch and dev workloads on serverless.
With Velar, you can mix both patterns in the same codebase — deploy with .deploy() for the serverless model, and run training or batch jobs on demand with the same Python SDK.