LLM-RLHF-Tuning Project Overview
LLM-RLHF-Tuning is an exciting project aimed at implementing a comprehensive three-stage RLHF (Reinforcement Learning with Human Feedback) training from scratch. The project's detailed documentation invites practitioners to engage, discuss, and explore its intricacies. Here’s a breakdown of what the project offers:
Key Features
-
Alpaca Model Fine-Tuning: The project supports instruction-based fine-tuning for the Alpaca model, allowing users to adapt it according to specific requirements.
-
Reward Model Training: Users can train reward models, which are crucial for evaluating the performance of different actions taken in reinforcement learning scenarios.
-
PPO Algorithm for RL Models:
- The PPO (Proximal Policy Optimization) algorithm is supported for training reinforcement learning models. This includes configurations for two base models and two LoRA (Low-Rank Adaptation) adapters. It is capable of loading four models simultaneously: RM (Reward Model), SFT (Supervised Fine-Tuning), Actor, and Critic. The project supports accelerated distributed training.
- Another configuration supports one base model with two LoRA adapters, leveraging Accelerate and DeepSpeed for training.
- A shared base model setup is also supported for Actor and Critic, enabling all four models (RM, SFT, Actor, Critic) to function with accelerate and DeepSpeed training.
-
DPO Algorithm Training: The project also includes support for training with the DPO (Deterministic Proximal Optimization) algorithm, providing an alternative approach to PPO for reinforcement learning.
Recent Updates
- As of August 23, 2023, the project now supports training for the LLaMA2 model and DPO training. Additionally, it enables PPO training based on a single base model with one or two LoRA adapters, with support for Accelerate and DeepSpeed.
- On August 13, 2023, support was added for LLaMA model training, accommodating both two base model scenarios with two LoRA adapters and Accelerate distributed training.
Feature Comparison with Open-Source RLHF Frameworks
The project is compared against other open-source frameworks in terms of several functionalities:
Framework | SFT Train | RM Train | PPO Train | DPO Train |
---|---|---|---|---|
LLM-RLHF-Tuning | ✅ | ✅ | ✅ | ✅ |
Deepspeed-chat | ✅ | ✅ | ✅ | |
trl | ✅ | ✅ | ✅ | ✅ |
MOSS-RLHF | ✅ |
PPO Training Capabilities
Further comparison among frameworks focuses on specific PPO training features:
Framework | Accelerate | DeepSpeed | Multi LoRA | Minimum Model Size (7B Example) |
---|---|---|---|---|
LLM-RLHF-Tuning | ✅ | ✅ | ✅ | Single model size ~ 7B |
Deepspeed-chat | ✅ | sft+rm+actor+critic ~ 28B | ||
trl | ✅ | Single model size ~ 7B | ||
MOSS-RLHF | Actor, Critic Models | SFT, RM Models | sft+rm+actor+critic ~ 28B |
Getting Started
Environment Setup
For running the project, the following packages are required:
accelerate==0.21.0
datasets==2.13.1
scikit-learn==1.3.0
sentencepiece==0.1.99
tqdm==4.65.0
transformers==4.31.0
wandb==0.15.8
peft==0.4.0
torch==2.0.1
trl==0.5.0
deepspeed==0.10.0
Supported Models and Training Methods
- Models: The project supports LLaMA and LLaMA2 models.
- Training: LoRA (Low-Rank Adaptation) training is supported.
Training Details
Fine-Tuning Models
An instructional guide for fine-tuning models is available to assist users in customizing models for specific tasks.
Training Guide for Fine-Tuning Models
Reward Model Training
Users can follow a comprehensive guide to training reward models, essential for reinforcement learning assessments.
Training Guide for Reward Models
PPO Training
Two detailed guides cover PPO training based on different base model configurations:
DPO Training
A dedicated guide helps users through the intricacies of DPO training.
Future Endeavors
Looking forward, the project aims to:
- Enhance PPO training stability (ppo-max).
- Introduce support for DDPO and RRHF.
- Implement RAFT and reject sampling RFT.
- Extend model support to BLOOM, Baichuan, and QLoRA.
For further collaboration and discussion, participants are encouraged to join the discussion group on WeChat.