A dedicated GPU pod reserved exclusively for your model — always warm, always running. Flat monthly rate, up to 18% cheaper than on-demand at full traffic.
Available on Pro ($49/mo) and Business ($199/mo) plans.
Use Serverless Jobs for experiments and batch workloads. Switch to Persistent Endpoints when you need production-grade latency and reliability.
| Aspect | Serverless Jobs | Persistent Endpoints |
|---|---|---|
| Best for | Experiments, batch jobs, CI/ML pipelines | Production APIs, real-time inference |
| Cold start | ~30-60 seconds | 0 ms (always warm) |
| Billing | Per second of execution | Flat monthly rate |
| Cost at 24/7 load | A100 = ~$1,700/mo | A100 = $1,400/mo |
| GPU sharing | Shared infrastructure | Dedicated pod — yours only |
| URL stability | Changes each deploy | Fixed custom domain |
| Idle cost | $0 (scale to zero) | Always billed |
Your model is always loaded. Every request hits a warm GPU instantly — no 30-60 second spin-up penalty for your users.
No per-second surprises. One predictable monthly bill that's up to 18% cheaper than running Serverless Jobs 24/7.
Your endpoint runs on a pod reserved exclusively for you. No noisy neighbors, consistent throughput, predictable latency.
Every Persistent Endpoint gets a stable URL at your subdomain. No pod IDs, no proxy URLs that change on restart.
Add more GPUs to your endpoint in seconds from the dashboard. No redeployment, no downtime — just more capacity.
Same Python SDK you already know. Change one line — from @app.function() to @app.endpoint() — and you're live.
Already using Serverless Jobs? Switching to a Persistent Endpoint is a single-line change. The same Python SDK, the same image, the same GPU — just always on.
import velar
app = velar.App("llama-api")
image = velar.Image.from_registry(
"pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("vllm")
# Deploy as a persistent endpoint — always on
@app.endpoint(gpu="A100", image=image)
def generate(prompt: str) -> str:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-8B")
return llm.generate([prompt], SamplingParams(max_tokens=512))[0].outputs[0].text
app.deploy()
# → Your endpoint is live at https://your-org.velar.run/llama-apiFlat monthly rate per GPU. Cheaper than running Serverless Jobs 24/7 — and with zero cold starts.
Serve Llama, Mistral, or your fine-tuned model as a production API endpoint. Latency under 200ms for typical prompts.
Stable Diffusion, FLUX, and video models need warm GPUs. Endpoints eliminate the cold-start that kills user experience.
High-throughput embedding APIs for RAG pipelines and semantic search — process thousands of requests per minute.
Vision models, speech-to-text, and multi-modal pipelines that need consistent sub-second response times.
Start with Serverless Jobs on the free plan. Upgrade to Pro when you're ready to go always-on.