LLM-RLHF-Tuning - Comprehensive Multi-Stage Training for LLaMA Models and Reward Optimization

LLM-RLHF-Tuning Project Overview

LLM-RLHF-Tuning is an exciting project aimed at implementing a comprehensive three-stage RLHF (Reinforcement Learning with Human Feedback) training from scratch. The project's detailed documentation invites practitioners to engage, discuss, and explore its intricacies. Here’s a breakdown of what the project offers:

Key Features

Alpaca Model Fine-Tuning: The project supports instruction-based fine-tuning for the Alpaca model, allowing users to adapt it according to specific requirements.
Reward Model Training: Users can train reward models, which are crucial for evaluating the performance of different actions taken in reinforcement learning scenarios.
PPO Algorithm for RL Models:
- The PPO (Proximal Policy Optimization) algorithm is supported for training reinforcement learning models. This includes configurations for two base models and two LoRA (Low-Rank Adaptation) adapters. It is capable of loading four models simultaneously: RM (Reward Model), SFT (Supervised Fine-Tuning), Actor, and Critic. The project supports accelerated distributed training.
- Another configuration supports one base model with two LoRA adapters, leveraging Accelerate and DeepSpeed for training.
- A shared base model setup is also supported for Actor and Critic, enabling all four models (RM, SFT, Actor, Critic) to function with accelerate and DeepSpeed training.
DPO Algorithm Training: The project also includes support for training with the DPO (Deterministic Proximal Optimization) algorithm, providing an alternative approach to PPO for reinforcement learning.

Recent Updates

As of August 23, 2023, the project now supports training for the LLaMA2 model and DPO training. Additionally, it enables PPO training based on a single base model with one or two LoRA adapters, with support for Accelerate and DeepSpeed.
On August 13, 2023, support was added for LLaMA model training, accommodating both two base model scenarios with two LoRA adapters and Accelerate distributed training.

Feature Comparison with Open-Source RLHF Frameworks

The project is compared against other open-source frameworks in terms of several functionalities:

Framework	SFT Train	RM Train	PPO Train	DPO Train
LLM-RLHF-Tuning	✅	✅	✅	✅
Deepspeed-chat	✅	✅	✅
trl	✅	✅	✅	✅
MOSS-RLHF			✅

PPO Training Capabilities

Further comparison among frameworks focuses on specific PPO training features:

Framework	Accelerate	DeepSpeed	Multi LoRA	Minimum Model Size (7B Example)
LLM-RLHF-Tuning	✅	✅	✅	Single model size ~ 7B
Deepspeed-chat		✅		sft+rm+actor+critic ~ 28B
trl	✅			Single model size ~ 7B
MOSS-RLHF	Actor, Critic Models	SFT, RM Models		sft+rm+actor+critic ~ 28B

Getting Started

Environment Setup

For running the project, the following packages are required:

accelerate==0.21.0
datasets==2.13.1
scikit-learn==1.3.0
sentencepiece==0.1.99
tqdm==4.65.0
transformers==4.31.0
wandb==0.15.8
peft==0.4.0
torch==2.0.1
trl==0.5.0
deepspeed==0.10.0

Supported Models and Training Methods

Models: The project supports LLaMA and LLaMA2 models.
Training: LoRA (Low-Rank Adaptation) training is supported.

Training Details

Fine-Tuning Models

An instructional guide for fine-tuning models is available to assist users in customizing models for specific tasks.

Training Guide for Fine-Tuning Models

Reward Model Training

Users can follow a comprehensive guide to training reward models, essential for reinforcement learning assessments.

Training Guide for Reward Models

PPO Training

Two detailed guides cover PPO training based on different base model configurations:

DPO Training

A dedicated guide helps users through the intricacies of DPO training.

Training Guide for DPO

Future Endeavors

Looking forward, the project aims to:

Enhance PPO training stability (ppo-max).
Introduce support for DDPO and RRHF.
Implement RAFT and reject sampling RFT.
Extend model support to BLOOM, Baichuan, and QLoRA.

For further collaboration and discussion, participants are encouraged to join the discussion group on WeChat.