Available on Velar

Rent A100 80GB GPU Cloud

NVIDIA A100 80GB on-demand from $2.36/hr. Per-second billing, scale to zero, no idle costs. Deploy your first workload in under 60 seconds.

VRAM80 GB
ArchitectureAmpere
Per second$0.000656/sec

Best for A100 80GB workloads

Large LLM inference

80 GB VRAM fits Llama 3 70B in fp16 or Llama 3.1 70B in fp8. Run production-grade inference APIs with vLLM at $2.36/hr.

Model fine-tuning

Fine-tune 7B–70B parameter models. Enough VRAM for full fine-tuning of 13B models or LoRA on 70B. Per-second billing means you only pay for actual compute.

Large-scale batch processing

Process millions of embeddings, run Whisper on long audio files, or compute batch inference at scale. The A100's 2 TB/s memory bandwidth minimizes bottlenecks.

A100 80GB specifications

GPUNVIDIA A100 80GB
VRAM80 GB
ArchitectureAmpere
CUDA Cores6,912
Tensor Cores432
Memory Bandwidth2,039 GB/s

A100 80GB pricing on Velar

Serverless Jobs

Scale to zero · per-second billing

$2.36/hr

  • $0.000656/sec — billed to the second
  • Scale to zero when idle
  • No reserved capacity

Persistent Endpoint

Always-on · flat monthly rate

$1,400/mo

  • $1.94/hr effective — 18% cheaper than on-demand
  • Zero cold-start latency
  • Always available, guaranteed GPU

Pro plan required · See plans

Deploy on A100 80GB in 12 lines of Python

No Dockerfile. No Kubernetes. Just a Python decorator and a GPU string.

inference.py
import velar

app = velar.App("a100-inference")
image = velar.Image.from_registry(
    "pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("vllm")

@app.function(gpu="A100", image=image)
def serve(prompt: str) -> str:
    from vllm import LLM, SamplingParams
    llm = LLM(model="meta-llama/Meta-Llama-3-70B-Instruct")
    params = SamplingParams(max_tokens=512)
    output = llm.generate([prompt], params)
    return output[0].outputs[0].text

app.deploy()
print(serve.remote("Explain the transformer architecture"))

Deploy on A100 80GB today

Start with $10 in free GPU credits. No credit card required. First workload live in under 60 seconds.