Use Cases/Model Fine-Tuning

Model fine-tuning on GPU

Fine-tune foundation models on your own data with managed training jobs. Velar provisions single or multi-GPU clusters, handles checkpointing to persistent storage, and streams training metrics in real time. Pay only for the GPU seconds used.

What Velar provides

Fine-tuning requires long-running jobs, persistent storage for checkpoints, and often multiple GPUs. Velar handles all of this so you can focus on your training loop.

  • Single and multi-GPU jobs (up to 8x A100/H100)
  • Persistent volume storage for checkpoints at /data
  • Long-running jobs with configurable timeout
  • Per-second billing — pay only for training time
  • Stream training logs in real time
  • Works with Hugging Face Trainer, TRL, Accelerate, DeepSpeed

Supported training approaches

LoRA / QLoRA

Parameter-efficient fine-tuning. Trains faster and fits on smaller GPUs.

Full fine-tuning (SFT)

Update all model weights. Requires more VRAM, better for smaller models.

RLHF / DPO

Preference-based alignment. Supported via TRL's RLHF and DPO trainers.

Distributed training

Multi-GPU with DeepSpeed ZeRO or Hugging Face Accelerate.

Code examples

LoRA fine-tuning or full distributed training on multiple GPUs.

LoRA fine-tuning — single A100

train.py
import velar

app = velar.App("fine-tune")

image = velar.Image.from_registry(
    "pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("transformers", "peft", "datasets", "trl", "bitsandbytes")

@app.function(gpu="A100", timeout=7200, image=image)
def train(dataset_path: str, base_model: str = "meta-llama/Llama-2-7b-hf"):
    from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
    from trl import SFTTrainer
    from datasets import load_dataset
    from peft import LoraConfig, get_peft_model

    model = AutoModelForCausalLM.from_pretrained(base_model)
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    dataset = load_dataset("json", data_files=dataset_path, split="train")

    lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])
    model = get_peft_model(model, lora_config)

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=dataset,
        args=TrainingArguments(
            output_dir="/data/checkpoints",
            num_train_epochs=3,
            per_device_train_batch_size=4,
            save_steps=500,
        ),
    )
    trainer.train()
    trainer.save_model("/data/final-model")

app.deploy()
train.remote("s3://my-bucket/dataset.jsonl")

Distributed training — 4x A100

distributed.py
import velar

app = velar.App("distributed-training")

image = velar.Image.from_registry(
    "pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("transformers", "accelerate", "deepspeed", "datasets")

# Multi-GPU training with 4x A100
@app.function(gpu="A100", gpu_count=4, timeout=14400, image=image)
def train_distributed(config_path: str):
    from accelerate import Accelerator
    from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
    from datasets import load_dataset

    accelerator = Accelerator()
    model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
    dataset = load_dataset("json", data_files=config_path)

    trainer = Trainer(
        model=model,
        args=TrainingArguments(
            output_dir="/data/output",
            num_train_epochs=5,
            per_device_train_batch_size=2,
            gradient_accumulation_steps=8,
        ),
        train_dataset=dataset["train"],
    )
    trainer.train()

GPU recommendations for fine-tuning

Training is more VRAM-intensive than inference. Use A100 or H100 for large models; L4 works well for LoRA on 7B models.

TaskRecommended GPUPrice
LoRA — 7B modelL4 24GB$0.66/hr
Full SFT — 7B modelA100 80GB$2.36/hr
Full SFT — 13B modelA100 80GB$2.36/hr
Full SFT — 70B model4× A100 80GB$9.44/hr
Large scale trainingH100 SXM 80GB$4.57/hr

Start your first training job

$10 in free GPU credits. No credit card required.