How to Deploy Llama 3 on GPU with Python

Llama 3 from Meta is one of the most capable open-weight language models available. Running it locally works for testing, but production inference requires a GPU with enough VRAM — and ideally, infrastructure that scales with demand without keeping a machine running idle.

This guide walks through deploying Llama 3 8B or 70B on a cloud GPU using vLLM and Velar. You'll go from zero to a live inference endpoint in under 10 minutes.

Prerequisites

You need Python 3.10+ and the Velar SDK installed. If you don't have a Velar account, sign up at velar.run/signup — you get $10 in free GPU credits.

terminal

pip install velar-sdk

You'll also need a Hugging Face token to download the Llama 3 weights. Create one at huggingface.co/settings/tokens and accept the Llama 3 license on the model page.

GPU Selection

Llama 3 model size determines your minimum GPU VRAM:

Model	Parameters	Min VRAM (fp16)	Recommended GPU
Llama 3 8B	8B	16 GB	L4 (24 GB)
Llama 3 70B	70B	140 GB	2× A100 80 GB
Llama 3 8B (int4)	8B	6 GB	L4

For this tutorial we'll use the 8B model on an L4. It's the cheapest option that fits the full model comfortably.

Deploying with vLLM

vLLM is the fastest open-source LLM inference engine. It uses PagedAttention for high-throughput serving and supports continuous batching, meaning multiple requests share GPU compute efficiently.

llama3_serve.py

import velar

app = velar.App("llama3-inference")

image = velar.Image.from_registry(
    "pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("vllm", "huggingface_hub")

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

@app.function(
    gpu="L4",
    image=image,
    timeout=600,
    secrets={"HF_TOKEN": "hf_your_token_here"},
)
def generate(prompt: str, max_tokens: int = 512) -> str:
    from vllm import LLM, SamplingParams

    llm = LLM(
        model=MODEL_ID,
        dtype="float16",
        max_model_len=4096,
    )
    params = SamplingParams(
        temperature=0.7,
        max_tokens=max_tokens,
    )

    messages = [
        {"role": "user", "content": prompt}
    ]
    # Format as Llama 3 chat template
    formatted = llm.get_tokenizer().apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    output = llm.generate([formatted], params)
    return output[0].outputs[0].text

if __name__ == "__main__":
    app.deploy()
    result = generate.remote("Explain transformers in one paragraph.")
    print(result)

First Deploy

Run the script:

terminal

python llama3_serve.py

Velar builds the container, pushes it to the registry, and deploys it on a GPU. The first deploy takes 2-3 minutes while the image is built and the model weights are downloaded. Subsequent calls use the cached image.

You should see output like:

terminal

Deploying llama3-inference...
✓ Image built (cached)
✓ Deployed: dep_abc123
Generating response...

Transformers are a deep learning architecture introduced in 2017...

Scaling to Production

The example above handles one request at a time. For production, vLLM already handles concurrent requests internally via continuous batching — you don't need to do anything special. Multiple callers to generate.remote() will be queued and batched automatically.

If you need lower latency at higher concurrency, deploy multiple instances or switch to an A100 for better throughput:

llama3_serve.py

@app.function(
    gpu="A100",      # ~3x higher throughput than L4
    image=image,
    timeout=600,
    secrets={"HF_TOKEN": "hf_your_token_here"},
)

Cost Estimate

Velar bills per second of GPU time. For this workload on an L4 ($0.66/hr):

Each inference call takes ~2-5 seconds for a typical prompt (after the model is loaded)
Cold start (model load): ~30-60 seconds on first call
1000 requests/day at 3 seconds each = 50 minutes of GPU time ≈ $0.55/day

Compare that to a dedicated GPU instance running 24/7 at $15.84/day for the same L4. Serverless wins for workloads that aren't continuous.

Next Steps

Wrap generate.remote() in a FastAPI endpoint for an OpenAI-compatible API
Try Llama 3 70B with 2× A100 by setting gpu="A100", gpu_count=2
Add streaming responses with generate.remote_gen()
Cache the model weights in a volume with volume_size_gb=100 to skip re-downloading on cold starts

How to Deploy Llama 3 on GPU with Python

Prerequisites

GPU Selection

Deploying with vLLM

First Deploy

Scaling to Production

Cost Estimate

Next Steps

Ready to deploy on GPU?