Fine-Tuning Parameters for vLLM on Gaudi3, H200, and MI300X

The rise of large language models (LLMs) has driven significant demand for efficient inference and fine-tuning frameworks. One such framework, vLLM, is optimised for high-performance serving with PagedAttention, allowing for memory-efficient execution across diverse hardware architectures. With the introduction of new AI accelerators such as Gaudi3, H200, and MI300X, optimising fine-tuning parameters is essential to leverage their full potential.
Understanding vLLM and Fine-Tuning
vLLM is an open-source library designed for optimised inference and fine-tuning of LLMs. It features an innovative memory management approach, improving efficiency over traditional PyTorch implementations. When fine-tuning models on specific hardware architectures, parameters such as batch size, sequence length, and optimiser settings play a critical role in performance.
Fine-Tuning on Gaudi3
Gaudi3, developed by Habana Labs (an Intel company), is designed for AI workloads with enhanced compute density and efficient BF16/FP8 support. When fine-tuning LLMs on Gaudi3 with vLLM, consider the following:
- Precision Settings: Use BF16 (
torch.float16
) for optimal performance, as it balances speed and numerical stability. - Batch Size:
- Small LLMs (7B-30B):
batch_size=128
- Large LLMs (>400B):
batch_size=8-16
, adjusting based on available memory.
- Small LLMs (7B-30B):
- Optimisers: Use the LAMB optimiser (
torch.optim.Lamb(lr=5e-4, weight_decay=0.01)
) which is optimised for Gaudi architecture. - Parallelism:
- Small LLMs:
tensor_parallel_size=8
- Large LLMs:
tensor_parallel_size=16+
to better utilise distributed compute.
- Small LLMs:
Fine-Tuning on H200
NVIDIA’s H200 builds upon the H100 architecture with enhanced memory bandwidth and increased VRAM capacity, making it ideal for fine-tuning LLMs with larger context windows. Key tuning parameters include:
- Mixed Precision: Use FP8 where supported (
dtype=torch.float8_e5m2
) to maximise throughput while maintaining accuracy. - KV Cache Optimisation: Enable memory-efficient attention mechanisms with
paged_attention=True
. - TensorRT-LLM Integration: Optimise execution using TensorRT-LLM with
trt_llm.enable=True
. - Microbatching:
- Small LLMs:
gradient_accumulation_steps=4
- Large LLMs:
gradient_accumulation_steps=32+
to work within memory constraints.
- Small LLMs:
- Sequence Length:
- Small LLMs:
max_sequence_length=2048
- Large LLMs:
max_sequence_length=8192+
, depending on dataset requirements.
- Small LLMs:
Fine-Tuning on MI300X
AMD’s MI300X offers a unique architecture optimised for AI workloads, particularly with its unified memory design. To fine-tune LLMs efficiently on MI300X:
- Precision Mode: Utilise FP16 (
torch.float16
) or BF16 (torch.bfloat16
) for optimal performance, avoiding excessive reliance on FP32 computations. - Memory Management: MI300X benefits from Unified Memory, so set
prefetch_factor=4
in data loaders to ensure efficient pre-loading of data. - Optimiser Selection: AdamW with fused implementations (
torch.optim.AdamW(lr=3e-4, betas=(0.9, 0.95), eps=1e-8)
) can improve performance on ROCm. - Distributed Training:
- Small LLMs:
zero_stage=2
- Large LLMs:
zero_stage=3
, using ZeRO-powered sharding to reduce memory overhead across multiple MI300X GPUs.
- Small LLMs:
General Best Practices for vLLM Fine-Tuning
Regardless of the hardware platform, the following best practices help optimise fine-tuning performance:
- Gradient Checkpointing:
- Small LLMs:
use_gradient_checkpointing=True
- Large LLMs: Required (
use_gradient_checkpointing=True
) to manage memory consumption efficiently.
- Small LLMs:
- Efficient Data Loading: Prefetch data using
num_workers=8
andpin_memory=True
to avoid bottlenecks. - Hyperparameter Tuning:
- Small LLMs: Experiment with
lr=[1e-4, 5e-4, 1e-3]
,weight_decay=0.01
, andwarmup_steps=1000
. - Large LLMs: Use a lower learning rate (
lr=3e-5
) with longer warm-up steps (warmup_steps=5000+
).
- Small LLMs: Experiment with
- Profiling and Monitoring: Use performance profiling tools specific to each hardware vendor (e.g.,
habana_profiler
,nsight-sys
,rocm-smi
) to identify bottlenecks.
Conclusion
Fine-tuning LLMs with vLLM on Gaudi3, H200, and MI300X requires an understanding of each platform’s architectural strengths. By optimising precision settings, parallelism, memory management, and optimiser choices, practitioners can significantly enhance training efficiency and model performance. Large LLMs introduce additional challenges, requiring more careful tuning of batch size, gradient accumulation, and memory allocation. As hardware and software ecosystems continue to evolve, staying informed on the latest optimisations will be crucial for maximising AI workload efficiency.