relora - Optimize Neural Training with ReLoRA's Low-Rank Update Methodology for Improved Model Performance

Introduction to ReLoRA: Enhancing Model Training with Low-Rank Updates

ReLoRA, an innovative project in the field of machine learning, stands for "Low-Rank Training Through Low-Rank Updates." This project focuses on enhancing the efficiency and effectiveness of model training by utilizing a method called Parameter-Efficient Fine-Tuning (PEFT), specifically with Low-Rank Adaptation (LoRA) techniques.

Setup and Requirements

To get started with ReLoRA, users need to ensure they have Python 3.10 or above and PyTorch 2.0 or higher. These versions are necessary because of the specific parameter annotations style and flash attention features used in the project. All required packages are listed in a requirements.txt file, which is continually updated. Users can install the necessary dependencies using the following commands:

pip install -e .
pip install flash-attn

Training with ReLoRA

Basic Training Script

ReLoRA offers a unique approach to training models with a recommended learning rate that is approximately twice what is typically used in standard training. While the learning rate might need adjustment for larger models, ReLoRA allows for larger microbatch sizes compared to regular training, depending on the available GPU memory. The number of steps for the training script is strategically calculated to start from a preselected checkpoint and reset points to optimize training.

An example command for running a 1B size model training with ReLoRA is:

torchrun --nproc-per-node 8 --nnodes 1 torchrun_main.py --training_config training_configs/1B_v1.0.yaml

Pre-Processing Data

Data pre-processing involves preparing the dataset to be fed into the training model. ReLoRA requires this step to set up data with a tokenizer and specified sequence length:

python pretokenize.py \
    --save_dir preprocessed_data \
    --tokenizer t5-base \
    --dataset c4 \
    --dataset_config en \
    --text_field text \
    --sequence_length 512

Warm-Up and PEFT Training

Before diving into ReLoRA, users should perform an initial warm-up using regular training to prepare the network. Once this phase is complete, the training continues with the ReLoRA method, which offers enhanced learning rates and optimizations:

Warm-Up Example:

torchrun --nproc-per-node <N_GPUS> torchrun_main.py \
    --model_config configs/llama_250m.json \
    --dataset_path <preprocessed data path> \
    --batch_size 24 \
    --total_batch_size 1152 \
    --lr 5e-4 \
    --max_length 512 \
    --save_every 1000 \
    --eval_every 1000 \
    --num_training_steps 20000 \
    --tags warm_start_250M

PEFT with ReLoRA:

torchrun --nproc-per-node <N_GPUS> torchrun_main.py \
    --model_config configs/llama_250m.json \
    --batch_size 24 \
    --total_batch_size 1152 \
    --lr 1e-3 \
    --max_length 512 \
    --use_peft \
    --relora 5000 \
    --cycle_length 5000 \
    --restart_warmup_steps 100 \
    --scheduler cosine_restarts \
    --warmup_steps 500 \
    --reset_optimizer_on_relora True \
    --num_training_steps 20000 \
    --save_every 5000 \
    --eval_every 5000 \
    --warmed_up_model <checkpoint path> \
    --tags relora_250M

Relora Mechanics

ReLoRA integrates and resets existing LoRA parameters within the main network, which offers more flexibility than traditional LoRA. However, experts advise users to manage optimizer states, learning rate schedules, and reset frequency diligently, with parameters provided for handling these tasks.

Distributed Training

The project facilitates single-node distributed training using PyTorch DDP, providing users with efficiency and ease in multi-GPU setups. For effective utilization, users should ensure all relevant options are properly set up, such as --nproc-per-node to define the number of GPUs.

Conclusion

ReLoRA represents a significant advancement in model training by leveraging low-rank updates to improve training efficiency. By providing a structured approach with comprehensive scripts and guidelines, ReLoRA enhances the model training process, creating new possibilities for machine learning practitioners.