llama-trl - Enhance Large Language Models Using PPO and LoRA Methodologies

LLaMA-TRL: Fine-tuning Language Models with Advanced Techniques

LLaMA-TRL is an innovative project aimed at enhancing the capabilities of large language models through a combination of cutting-edge techniques such as Proximal Policy Optimization (PPO) and Low-Rank Adaptation (LoRA). The project capitalizes on the advantages of Transformer Reinforcement Learning (TRL) and Parameter-Efficient Fine-Tuning (PEFT) to deliver models that are adept at following complex instructions. Here's a closer look into what LLaMA-TRL offers and how one can utilize it:

Project Overview

PPO with TRL: LLaMA-TRL implements PPO to optimize large language models through reinforcement learning, enabling them to make better decisions in generating language outputs.
LoRA with PEFT: This approach uses low-rank adaptations to fine-tune models, making the process more efficient by reducing parameter complexity.
Data Collection: The project uses instruction-following data gathered from the GPT-4-LLM repository, ensuring that the models are well-equipped to handle a variety of tasks.

How to Get Started

Setup

To begin using LLaMA-TRL, you need to first install the necessary dependencies by executing:

pip install -r requirements.txt

Once the dependencies are set up, you can proceed with the model development process outlined in three main stages: supervised fine-tuning, training a reward model, and tuning with PPO.

Step 1 - Supervised Fine-tuning

Supervised fine-tuning is a key stage in training the model. The provided command can be used to fine-tune a base model using instruction data:

torchrun --nnodes 1 --nproc_per_node 8 supervised_finetuning.py \
    --base_model 'decapoda-research/llama-7b-hf' \
    --dataset_name './data/alpaca_gpt4_data.json' \
    --streaming \
    --lr_scheduler_type 'cosine' \
    --learning_rate 1e-5 \
    --max_steps 4000 \
    --output_dir './checkpoints/supervised_llama/'

For users interested in full weight fine-tuning, integrating DeepSpeed stage-3 offloading enhances efficiency:

pip install deepspeed
torchrun --nnodes 1 --nproc_per_node 8 supervised_finetuning_full_weight.py \
    --base_model 'decapoda-research/llama-7b-hf' \
    --dataset_name './data/alpaca_gpt4_data.json' \
    --streaming \
    --lr_scheduler_type 'cosine' \
    --learning_rate 2e-5 \
    ...

Step 2 - Training Reward Model

Developing a reward model is the next step. This model is critical for assessing the performance of the language model and guiding improvements:

torchrun --nnodes 1 --nproc_per_node 8 training_reward_model.py \
    --model_name 'decapoda-research/llama-7b-hf' \
    --dataset_name './data/comparison_data.json' \
    --output_dir './checkpoints/training_reward_model/'

Step 3 - Tuning LM with PPO

The final step involves fine-tuning the language model using PPO, a process that iteratively improves the model based on feedback:

accelerate launch --multi_gpu --num_machines 1 --num_processes 8 \
    tuning_lm_with_rl.py \
    --log_with wandb \
    --model_name <LLAMA_FINETUNED_MODEL> \
    --reward_model_name <LLAMA_RM_MODEL> \
    ...

Conclusion

LLaMA-TRL brings together some of the most advanced techniques in language model training and fine-tuning, providing users with powerful tools to create models that can handle complex, instruction-based tasks. Whether you're developing a new model or enhancing an existing one, LLaMA-TRL offers a robust framework for success.