Persistent Endpoints

Production AI APIs. Zero cold starts.

A dedicated GPU pod reserved exclusively for your model — always warm, always running. Flat monthly rate, up to 18% cheaper than on-demand at full traffic.

Available on Pro ($49/mo) and Business ($199/mo) plans.

0 ms
Cold start latency
18%
Savings vs on-demand 24/7
99.9%
Uptime SLA
1 line
Change from Serverless Jobs

Jobs vs Endpoints — pick the right tool

Use Serverless Jobs for experiments and batch workloads. Switch to Persistent Endpoints when you need production-grade latency and reliability.

Aspect
Serverless Jobs
Persistent Endpoints
Best forExperiments, batch jobs, CI/ML pipelinesProduction APIs, real-time inference
Cold start~30-60 seconds0 ms (always warm)
BillingPer second of executionFlat monthly rate
Cost at 24/7 loadA100 = ~$1,700/moA100 = $1,400/mo
GPU sharingShared infrastructureDedicated pod — yours only
URL stabilityChanges each deployFixed custom domain
Idle cost$0 (scale to zero)Always billed

Built for production

Zero cold-start latency

Your model is always loaded. Every request hits a warm GPU instantly — no 30-60 second spin-up penalty for your users.

Flat monthly rate

No per-second surprises. One predictable monthly bill that's up to 18% cheaper than running Serverless Jobs 24/7.

Dedicated GPU — not shared

Your endpoint runs on a pod reserved exclusively for you. No noisy neighbors, consistent throughput, predictable latency.

Custom domain included

Every Persistent Endpoint gets a stable URL at your subdomain. No pod IDs, no proxy URLs that change on restart.

Instant scale-up

Add more GPUs to your endpoint in seconds from the dashboard. No redeployment, no downtime — just more capacity.

Drop-in from Serverless Jobs

Same Python SDK you already know. Change one line — from @app.function() to @app.endpoint() — and you're live.

One decorator away from production

Already using Serverless Jobs? Switching to a Persistent Endpoint is a single-line change. The same Python SDK, the same image, the same GPU — just always on.

  • Change @app.function() to @app.endpoint()
  • Run app.deploy() once
  • Get a stable URL — no redeployments needed
  • Your model stays loaded between requests
Read the Endpoints docs
inference_api.py
import velar

app = velar.App("llama-api")
image = velar.Image.from_registry(
    "pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("vllm")

# Deploy as a persistent endpoint — always on
@app.endpoint(gpu="A100", image=image)
def generate(prompt: str) -> str:
    from vllm import LLM, SamplingParams
    llm = LLM(model="meta-llama/Llama-3-8B")
    return llm.generate([prompt], SamplingParams(max_tokens=512))[0].outputs[0].text

app.deploy()
# → Your endpoint is live at https://your-org.velar.run/llama-api

Endpoint pricing

Flat monthly rate per GPU. Cheaper than running Serverless Jobs 24/7 — and with zero cold starts.

RTX 4090
24 GB VRAM
$600/month
$0.83/hr effective for 24/7
17% cheaper than Jobs 24/7
vs $1.00/hr × 720 hrs = $720/mo on-demand
  • Zero cold-start latency
  • Dedicated GPU — not shared
  • Custom domain included
  • 99.9% uptime SLA
Get Started
Most Popular
A100 80GB
80 GB VRAM
$1,400/month
$1.94/hr effective for 24/7
18% cheaper than Jobs 24/7
vs $2.36/hr × 720 hrs = $1,699/mo on-demand
  • Zero cold-start latency
  • Dedicated GPU — not shared
  • Custom domain included
  • 99.9% uptime SLA
Get Started
H100 SXM 80GB
80 GB VRAM
$2,800/month
$3.89/hr effective for 24/7
15% cheaper than Jobs 24/7
vs $4.57/hr × 720 hrs = $3,290/mo on-demand
  • Zero cold-start latency
  • Dedicated GPU — not shared
  • Custom domain included
  • 99.9% uptime SLA
Get Started
Need H100, H200, or multi-GPU? Endpoints are available for all GPU types. Contact sales@velar.run for custom configurations.Requires Pro ($49/mo) or Business ($199/mo) plan. View all plans →

Ship your production inference API

Start with Serverless Jobs on the free plan. Upgrade to Pro when you're ready to go always-on.