Persistent Endpoints

Production AI APIs. Zero cold starts.

A dedicated GPU pod reserved exclusively for your model — always warm, always running. Flat monthly rate, up to 18% cheaper than on-demand at full traffic.

Get Started — Pro plan required See full pricing

Available on Pro ($49/mo) and Business ($199/mo) plans.

0 ms

Cold start latency

18%

Savings vs on-demand 24/7

99.9%

Uptime SLA

1 line

Change from Serverless Jobs

Jobs vs Endpoints — pick the right tool

Use Serverless Jobs for experiments and batch workloads. Switch to Persistent Endpoints when you need production-grade latency and reliability.

Aspect	Serverless Jobs	Persistent Endpoints
Best for	Experiments, batch jobs, CI/ML pipelines	Production APIs, real-time inference
Cold start	~30-60 seconds	0 ms (always warm)
Billing	Per second of execution	Flat monthly rate
Cost at 24/7 load	A100 = ~$1,700/mo	A100 = $1,400/mo
GPU sharing	Shared infrastructure	Dedicated pod — yours only
URL stability	Changes each deploy	Fixed custom domain
Idle cost	$0 (scale to zero)	Always billed

Built for production

Zero cold-start latency

Your model is always loaded. Every request hits a warm GPU instantly — no 30-60 second spin-up penalty for your users.

Flat monthly rate

No per-second surprises. One predictable monthly bill that's up to 18% cheaper than running Serverless Jobs 24/7.

Dedicated GPU — not shared

Your endpoint runs on a pod reserved exclusively for you. No noisy neighbors, consistent throughput, predictable latency.

Custom domain included

Every Persistent Endpoint gets a stable URL at your subdomain. No pod IDs, no proxy URLs that change on restart.

Instant scale-up

Add more GPUs to your endpoint in seconds from the dashboard. No redeployment, no downtime — just more capacity.

Drop-in from Serverless Jobs

Same Python SDK you already know. Change one line — from @app.function() to @app.endpoint() — and you're live.

One decorator away from production

Already using Serverless Jobs? Switching to a Persistent Endpoint is a single-line change. The same Python SDK, the same image, the same GPU — just always on.

Change @app.function() to @app.endpoint()
Run app.deploy() once
Get a stable URL — no redeployments needed
Your model stays loaded between requests

Read the Endpoints docs

inference_api.py

import velar

app = velar.App("llama-api")
image = velar.Image.from_registry(
    "pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("vllm")

# Deploy as a persistent endpoint — always on
@app.endpoint(gpu="A100", image=image)
def generate(prompt: str) -> str:
    from vllm import LLM, SamplingParams
    llm = LLM(model="meta-llama/Llama-3-8B")
    return llm.generate([prompt], SamplingParams(max_tokens=512))[0].outputs[0].text

app.deploy()
# → Your endpoint is live at https://your-org.velar.run/llama-api

Endpoint pricing

Flat monthly rate per GPU. Cheaper than running Serverless Jobs 24/7 — and with zero cold starts.

RTX 4090

24 GB VRAM

$600/month

≈ $0.83/hr effective for 24/7

17% cheaper than Jobs 24/7

vs $1.00/hr × 720 hrs = $720/mo on-demand

Zero cold-start latency
Dedicated GPU — not shared
Custom domain included
99.9% uptime SLA

Get Started

Ship your production inference API

Start with Serverless Jobs on the free plan. Upgrade to Pro when you're ready to go always-on.

Start Free Compare all plans →