How to Deploy Llama 3 on GPU with Python
Llama 3 from Meta is one of the most capable open-weight language models available. Running it locally works for testing, but production inference requires a GPU with enough VRAM — and ideally, infrastructure that scales with demand without keeping a machine running idle.
This guide walks through deploying Llama 3 8B or 70B on a cloud GPU using vLLM and Velar. You'll go from zero to a live inference endpoint in under 10 minutes.
Prerequisites
You need Python 3.10+ and the Velar SDK installed. If you don't have a Velar account, sign up at velar.run/signup — you get $10 in free GPU credits.
pip install velar-sdkYou'll also need a Hugging Face token to download the Llama 3 weights. Create one at huggingface.co/settings/tokens and accept the Llama 3 license on the model page.
GPU Selection
Llama 3 model size determines your minimum GPU VRAM:
| Model | Parameters | Min VRAM (fp16) | Recommended GPU |
|---|---|---|---|
| Llama 3 8B | 8B | 16 GB | L4 (24 GB) |
| Llama 3 70B | 70B | 140 GB | 2× A100 80 GB |
| Llama 3 8B (int4) | 8B | 6 GB | L4 |
For this tutorial we'll use the 8B model on an L4. It's the cheapest option that fits the full model comfortably.
Deploying with vLLM
vLLM is the fastest open-source LLM inference engine. It uses PagedAttention for high-throughput serving and supports continuous batching, meaning multiple requests share GPU compute efficiently.
import velar
app = velar.App("llama3-inference")
image = velar.Image.from_registry(
"pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("vllm", "huggingface_hub")
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
@app.function(
gpu="L4",
image=image,
timeout=600,
secrets={"HF_TOKEN": "hf_your_token_here"},
)
def generate(prompt: str, max_tokens: int = 512) -> str:
from vllm import LLM, SamplingParams
llm = LLM(
model=MODEL_ID,
dtype="float16",
max_model_len=4096,
)
params = SamplingParams(
temperature=0.7,
max_tokens=max_tokens,
)
messages = [
{"role": "user", "content": prompt}
]
# Format as Llama 3 chat template
formatted = llm.get_tokenizer().apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
output = llm.generate([formatted], params)
return output[0].outputs[0].text
if __name__ == "__main__":
app.deploy()
result = generate.remote("Explain transformers in one paragraph.")
print(result)First Deploy
Run the script:
python llama3_serve.pyVelar builds the container, pushes it to the registry, and deploys it on a GPU. The first deploy takes 2-3 minutes while the image is built and the model weights are downloaded. Subsequent calls use the cached image.
You should see output like:
Deploying llama3-inference...
✓ Image built (cached)
✓ Deployed: dep_abc123
Generating response...
Transformers are a deep learning architecture introduced in 2017...Scaling to Production
The example above handles one request at a time. For production, vLLM already handles concurrent requests internally via continuous batching — you don't need to do anything special. Multiple callers to generate.remote() will be queued and batched automatically.
If you need lower latency at higher concurrency, deploy multiple instances or switch to an A100 for better throughput:
@app.function(
gpu="A100", # ~3x higher throughput than L4
image=image,
timeout=600,
secrets={"HF_TOKEN": "hf_your_token_here"},
)Cost Estimate
Velar bills per second of GPU time. For this workload on an L4 ($0.66/hr):
- Each inference call takes ~2-5 seconds for a typical prompt (after the model is loaded)
- Cold start (model load): ~30-60 seconds on first call
- 1000 requests/day at 3 seconds each = 50 minutes of GPU time ≈ $0.55/day
Compare that to a dedicated GPU instance running 24/7 at $15.84/day for the same L4. Serverless wins for workloads that aren't continuous.
Next Steps
- Wrap
generate.remote()in a FastAPI endpoint for an OpenAI-compatible API - Try Llama 3 70B with 2× A100 by setting
gpu="A100", gpu_count=2 - Add streaming responses with
generate.remote_gen() - Cache the model weights in a volume with
volume_size_gb=100to skip re-downloading on cold starts