Introduction to ReLoRA: Enhancing Model Training with Low-Rank Updates
ReLoRA, an innovative project in the field of machine learning, stands for "Low-Rank Training Through Low-Rank Updates." This project focuses on enhancing the efficiency and effectiveness of model training by utilizing a method called Parameter-Efficient Fine-Tuning (PEFT), specifically with Low-Rank Adaptation (LoRA) techniques.
Setup and Requirements
To get started with ReLoRA, users need to ensure they have Python 3.10 or above and PyTorch 2.0 or higher. These versions are necessary because of the specific parameter annotations style and flash attention features used in the project. All required packages are listed in a requirements.txt
file, which is continually updated. Users can install the necessary dependencies using the following commands:
pip install -e .
pip install flash-attn
Training with ReLoRA
Basic Training Script
ReLoRA offers a unique approach to training models with a recommended learning rate that is approximately twice what is typically used in standard training. While the learning rate might need adjustment for larger models, ReLoRA allows for larger microbatch sizes compared to regular training, depending on the available GPU memory. The number of steps for the training script is strategically calculated to start from a preselected checkpoint and reset points to optimize training.
An example command for running a 1B size model training with ReLoRA is:
torchrun --nproc-per-node 8 --nnodes 1 torchrun_main.py --training_config training_configs/1B_v1.0.yaml
Pre-Processing Data
Data pre-processing involves preparing the dataset to be fed into the training model. ReLoRA requires this step to set up data with a tokenizer and specified sequence length:
python pretokenize.py \
--save_dir preprocessed_data \
--tokenizer t5-base \
--dataset c4 \
--dataset_config en \
--text_field text \
--sequence_length 512
Warm-Up and PEFT Training
Before diving into ReLoRA, users should perform an initial warm-up using regular training to prepare the network. Once this phase is complete, the training continues with the ReLoRA method, which offers enhanced learning rates and optimizations:
Warm-Up Example:
torchrun --nproc-per-node <N_GPUS> torchrun_main.py \
--model_config configs/llama_250m.json \
--dataset_path <preprocessed data path> \
--batch_size 24 \
--total_batch_size 1152 \
--lr 5e-4 \
--max_length 512 \
--save_every 1000 \
--eval_every 1000 \
--num_training_steps 20000 \
--tags warm_start_250M
PEFT with ReLoRA:
torchrun --nproc-per-node <N_GPUS> torchrun_main.py \
--model_config configs/llama_250m.json \
--batch_size 24 \
--total_batch_size 1152 \
--lr 1e-3 \
--max_length 512 \
--use_peft \
--relora 5000 \
--cycle_length 5000 \
--restart_warmup_steps 100 \
--scheduler cosine_restarts \
--warmup_steps 500 \
--reset_optimizer_on_relora True \
--num_training_steps 20000 \
--save_every 5000 \
--eval_every 5000 \
--warmed_up_model <checkpoint path> \
--tags relora_250M
Relora Mechanics
ReLoRA integrates and resets existing LoRA parameters within the main network, which offers more flexibility than traditional LoRA. However, experts advise users to manage optimizer states, learning rate schedules, and reset frequency diligently, with parameters provided for handling these tasks.
Distributed Training
The project facilitates single-node distributed training using PyTorch DDP, providing users with efficiency and ease in multi-GPU setups. For effective utilization, users should ensure all relevant options are properly set up, such as --nproc-per-node
to define the number of GPUs.
Conclusion
ReLoRA represents a significant advancement in model training by leveraging low-rank updates to improve training efficiency. By providing a structured approach with comprehensive scripts and guidelines, ReLoRA enhances the model training process, creating new possibilities for machine learning practitioners.