Fine-Tuning Parameters for vLLM on Gaudi3, H200, and MI300X

The rise of large language models (LLMs) has driven significant demand for efficient inference and fine-tuning frameworks. One such framework, vLLM, is optimised for high-performance serving with PagedAttention, allowing for memory-efficient execution across diverse hardware architectures. With the introduction of new AI accelerators such as Gaudi3, H200, and MI300X, optimising fine-tuning parameters is essential to leverage their full potential.

Understanding vLLM and Fine-Tuning

vLLM is an open-source library designed for optimised inference and fine-tuning of LLMs. It features an innovative memory management approach, improving efficiency over traditional PyTorch implementations. When fine-tuning models on specific hardware architectures, parameters such as batch size, sequence length, and optimiser settings play a critical role in performance.

Fine-Tuning on Gaudi3

Gaudi3, developed by Habana Labs (an Intel company), is designed for AI workloads with enhanced compute density and efficient BF16/FP8 support. When fine-tuning LLMs on Gaudi3 with vLLM, consider the following:

  • Precision Settings: Use BF16 (torch.float16) for optimal performance, as it balances speed and numerical stability.
  • Batch Size:
    • Small LLMs (7B-30B): batch_size=128
    • Large LLMs (>400B): batch_size=8-16, adjusting based on available memory.
  • Optimisers: Use the LAMB optimiser (torch.optim.Lamb(lr=5e-4, weight_decay=0.01)) which is optimised for Gaudi architecture.
  • Parallelism:
    • Small LLMs: tensor_parallel_size=8
    • Large LLMs: tensor_parallel_size=16+ to better utilise distributed compute.

Fine-Tuning on H200

NVIDIA’s H200 builds upon the H100 architecture with enhanced memory bandwidth and increased VRAM capacity, making it ideal for fine-tuning LLMs with larger context windows. Key tuning parameters include:

  • Mixed Precision: Use FP8 where supported (dtype=torch.float8_e5m2) to maximise throughput while maintaining accuracy.
  • KV Cache Optimisation: Enable memory-efficient attention mechanisms with paged_attention=True.
  • TensorRT-LLM Integration: Optimise execution using TensorRT-LLM with trt_llm.enable=True.
  • Microbatching:
    • Small LLMs: gradient_accumulation_steps=4
    • Large LLMs: gradient_accumulation_steps=32+ to work within memory constraints.
  • Sequence Length:
    • Small LLMs: max_sequence_length=2048
    • Large LLMs: max_sequence_length=8192+, depending on dataset requirements.

Fine-Tuning on MI300X

AMD’s MI300X offers a unique architecture optimised for AI workloads, particularly with its unified memory design. To fine-tune LLMs efficiently on MI300X:

  • Precision Mode: Utilise FP16 (torch.float16) or BF16 (torch.bfloat16) for optimal performance, avoiding excessive reliance on FP32 computations.
  • Memory Management: MI300X benefits from Unified Memory, so set prefetch_factor=4 in data loaders to ensure efficient pre-loading of data.
  • Optimiser Selection: AdamW with fused implementations (torch.optim.AdamW(lr=3e-4, betas=(0.9, 0.95), eps=1e-8)) can improve performance on ROCm.
  • Distributed Training:
    • Small LLMs: zero_stage=2
    • Large LLMs: zero_stage=3, using ZeRO-powered sharding to reduce memory overhead across multiple MI300X GPUs.

General Best Practices for vLLM Fine-Tuning

Regardless of the hardware platform, the following best practices help optimise fine-tuning performance:

  • Gradient Checkpointing:
    • Small LLMs: use_gradient_checkpointing=True
    • Large LLMs: Required (use_gradient_checkpointing=True) to manage memory consumption efficiently.
  • Efficient Data Loading: Prefetch data using num_workers=8 and pin_memory=True to avoid bottlenecks.
  • Hyperparameter Tuning:
    • Small LLMs: Experiment with lr=[1e-4, 5e-4, 1e-3], weight_decay=0.01, and warmup_steps=1000.
    • Large LLMs: Use a lower learning rate (lr=3e-5) with longer warm-up steps (warmup_steps=5000+).
  • Profiling and Monitoring: Use performance profiling tools specific to each hardware vendor (e.g., habana_profiler, nsight-sys, rocm-smi) to identify bottlenecks.

Conclusion

Fine-tuning LLMs with vLLM on Gaudi3, H200, and MI300X requires an understanding of each platform’s architectural strengths. By optimising precision settings, parallelism, memory management, and optimiser choices, practitioners can significantly enhance training efficiency and model performance. Large LLMs introduce additional challenges, requiring more careful tuning of batch size, gradient accumulation, and memory allocation. As hardware and software ecosystems continue to evolve, staying informed on the latest optimisations will be crucial for maximising AI workload efficiency.