Fine-tune foundation models on your own data with managed training jobs. Velar provisions single or multi-GPU clusters, handles checkpointing to persistent storage, and streams training metrics in real time. Pay only for the GPU seconds used.
Fine-tuning requires long-running jobs, persistent storage for checkpoints, and often multiple GPUs. Velar handles all of this so you can focus on your training loop.
LoRA / QLoRA
Parameter-efficient fine-tuning. Trains faster and fits on smaller GPUs.
Full fine-tuning (SFT)
Update all model weights. Requires more VRAM, better for smaller models.
RLHF / DPO
Preference-based alignment. Supported via TRL's RLHF and DPO trainers.
Distributed training
Multi-GPU with DeepSpeed ZeRO or Hugging Face Accelerate.
LoRA fine-tuning or full distributed training on multiple GPUs.
LoRA fine-tuning — single A100
import velar
app = velar.App("fine-tune")
image = velar.Image.from_registry(
"pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("transformers", "peft", "datasets", "trl", "bitsandbytes")
@app.function(gpu="A100", timeout=7200, image=image)
def train(dataset_path: str, base_model: str = "meta-llama/Llama-2-7b-hf"):
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
model = AutoModelForCausalLM.from_pretrained(base_model)
tokenizer = AutoTokenizer.from_pretrained(base_model)
dataset = load_dataset("json", data_files=dataset_path, split="train")
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])
model = get_peft_model(model, lora_config)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=TrainingArguments(
output_dir="/data/checkpoints",
num_train_epochs=3,
per_device_train_batch_size=4,
save_steps=500,
),
)
trainer.train()
trainer.save_model("/data/final-model")
app.deploy()
train.remote("s3://my-bucket/dataset.jsonl")Distributed training — 4x A100
import velar
app = velar.App("distributed-training")
image = velar.Image.from_registry(
"pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
).pip_install("transformers", "accelerate", "deepspeed", "datasets")
# Multi-GPU training with 4x A100
@app.function(gpu="A100", gpu_count=4, timeout=14400, image=image)
def train_distributed(config_path: str):
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
accelerator = Accelerator()
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
dataset = load_dataset("json", data_files=config_path)
trainer = Trainer(
model=model,
args=TrainingArguments(
output_dir="/data/output",
num_train_epochs=5,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
),
train_dataset=dataset["train"],
)
trainer.train()Training is more VRAM-intensive than inference. Use A100 or H100 for large models; L4 works well for LoRA on 7B models.
| Task | Recommended GPU | Price |
|---|---|---|
| LoRA — 7B model | L4 24GB | $0.66/hr |
| Full SFT — 7B model | A100 80GB | $2.36/hr |
| Full SFT — 13B model | A100 80GB | $2.36/hr |
| Full SFT — 70B model | 4× A100 80GB | $9.44/hr |
| Large scale training | H100 SXM 80GB | $4.57/hr |