NVIDIA A100 80GB on-demand from $2.36/hr. Per-second billing, scale to zero, no idle costs. Deploy your first workload in under 60 seconds.
80 GB VRAM fits Llama 3 70B in fp16 or Llama 3.1 70B in fp8. Run production-grade inference APIs with vLLM at $2.36/hr.
Fine-tune 7B–70B parameter models. Enough VRAM for full fine-tuning of 13B models or LoRA on 70B. Per-second billing means you only pay for actual compute.
Process millions of embeddings, run Whisper on long audio files, or compute batch inference at scale. The A100's 2 TB/s memory bandwidth minimizes bottlenecks.
| GPU | NVIDIA A100 80GB |
| VRAM | 80 GB |
| Architecture | Ampere |
| CUDA Cores | 6,912 |
| Tensor Cores | 432 |
| Memory Bandwidth | 2,039 GB/s |
Serverless Jobs
Scale to zero · per-second billing
$2.36/hr
Persistent Endpoint
Always-on · flat monthly rate
$1,400/mo
Pro plan required · See plans
No Dockerfile. No Kubernetes. Just a Python decorator and a GPU string.
import velar
app = velar.App("a100-inference")
image = velar.Image.from_registry(
"pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("vllm")
@app.function(gpu="A100", image=image)
def serve(prompt: str) -> str:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Meta-Llama-3-70B-Instruct")
params = SamplingParams(max_tokens=512)
output = llm.generate([prompt], params)
return output[0].outputs[0].text
app.deploy()
print(serve.remote("Explain the transformer architecture"))Start with $10 in free GPU credits. No credit card required. First workload live in under 60 seconds.