fsdp_qlora - Innovative LLM Training Using Quantized LoRA & FSDP for Efficiency

Introduction to fsdp_qlora

The fsdp_qlora project focuses on training large language models (LLMs) using a unique combination of Quantized LoRA and Fully Sharded Data Parallel (FSDP) strategies. This project is in its early stages, and users are cautioned to approach it with a readiness for testing and debugging as the community continues to refine the approach.

Integrations

The integration of FSDP+QLoRA is available on platforms such as Axolotl, where experimental support is currently being offered.

Installation Steps

To get started with fsdp_qlora, one should follow several installation steps. Firstly, clone the fsdp_qlora repository from GitHub. Then, install the required packages using pip, including llama-recipes, fastcore, and specific versions of the transformers package. Users should also install bitsandbytes and login via the huggingface-cli to access Llama 2. Optional libraries like HQQ quantization and Weights and Biases logging can be installed based on user preferences. For optimal performance, it is recommended to have Pytorch version 2.2 or higher.

Finetuning Llama-2 70B

The project provides instructions for finetuning the Llama-2 70B model on dual 24GB GPUs. Users must navigate to the fsdp_qlora directory and execute a command that initiates the finetuning process on a dataset with specified model and training parameters. Notably, this procedure necessitates over 128GB of CPU RAM, so users with limited RAM are advised to create a swap file to handle peak usage.

Training Options

The fsdp_qlora project supports various training types, each with distinct parameters:

Full Parameter Fine-Tuning: This involves adjusting all model parameters.
LoRA Fine-Tuning: Utilizes the HF PEFT library for tuning.
Quantized LoRA (QLoRA): Uses bitsandbytes for 4-bit quantization, and is essential for lower-memory usage and performance efficiency.
Custom LoRA and DoRA Variations: These options allow for more tailored fine-tuning approaches using custom modules and bit-quantized layers.
Llama-Pro Fine-Tuning: Offers quantized training using either bitsandbytes or HQQ libraries.

For users facing memory constraints, fsdp_qlora provides low memory loading strategies that avoid full model loading into GPU memory.

Mixed Precision Training

fsdp_qlora supports different precision settings to balance between performance and the limitations of hardware resources:

bfloat16: Casts parameters to bfloat16 for all training operations.
float32: Ensures all processes are in float32, suitable for high precision.
Mixed Precision with Autocast: Offers a versatile approach allowing operations in varied precisions while managing resources efficiently.

Training with Limitations and Considerations

The project highlights certain limitations of the current release, such as compatibility issues with Transformer AutoModel.from_pretrained for loading quantized weights, necessitating custom loading scripts. Further, caution is advised regarding Mixed Precision and ensuring aligned dtype settings to prevent unintended consequences in weight casting.

Example and Advanced Configurations

The project outlines a comprehensive example for training models like Llama 70B using 4 A100 40GB GPUs with BnB QLoRA or HQQ QLoRA settings. It also suggests SLURM-based multi-node training setup for distributed training across larger configurations.

Expanding Model Support

The guide includes instructions for integrating new models by modifying transformer, attention, and MLP layers in the Transformers library, and applying appropriate wrapping policies for a seamless integration into the fsdp_qlora framework.

In summary, fsdp_qlora is a sophisticated project aimed at optimizing the training of large language models through innovative quantization and data parallel methods. It offers flexibility and efficiency in handling large models with an array of configuration options tailored to different hardware capacities and precision requirements.