DPO: Direct Preference Optimization
The DPO project represents a significant stride in training language models using preference data, as elucidated in the paper, "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." This repository not only houses the original DPO algorithm but also introduces 'conservative' DPO and IPO variations, enhancing the spectrum of training capabilities for language models.
Understanding the Repository
This repository offers a reference implementation of the DPO algorithm, tailored for training language models using preference data. Notably, the implementation supports any causal model from the HuggingFace library, making it versatile and easily adaptable for various datasets.
Key Components of the DPO Pipeline
The DPO training approach involves two primary stages:
- Supervised Fine-Tuning (SFT): This initial step involves refining the language model on the chosen datasets to ensure the model aligns closely with the existing data distribution.
- Preference Learning: Utilizing preference data obtained from similar data distributions, this stage involves training the refined model further to internalize the preferences.
Core Files in the Repository
train.py
: Serves as the main entry point to the training process, catering to both SFT and DPO-based training.trainers.py
: Hosts classes that define the training procedure, including support for multi-GPU setups.utils.py
: Includes utility functions employed across various files in the repository.preference_datasets.py
: Manages dataset processing for SFT and DPO training. This file is critical if users want to incorporate their own datasets.
Implementation Process
Running Supervised Fine-Tuning (SFT)
To align the model with preference data, begin with SFT:
For instance, running SFT on the Pythia 6.9B model with Anthropic-HH data involves:
python -u train.py model=pythia69 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia69 gradient_accumulation_steps=2 batch_size=64 eval_batch_size=32 trainer=FSDPTrainer sample_during_eval=false
Running Direct Preference Optimization (DPO)
Building on the SFT model, initiate DPO by incorporating:
loss=dpo
loss.beta=DESIRED_BETA
(where typical values range from 0.1 to 0.5)
For example, to execute DPO on the Pythia 6.9B model, you would use:
python -u train.py model=pythia69 datasets=[hh] loss=dpo loss.beta=0.1 model.archive=/path/to/checkpoint/from/sft/step-XXXX/policy.pt exp_name=anthropic_dpo_pythia69 gradient_accumulation_steps=2 batch_size=32 eval_batch_size=32 trainer=FSDPTrainer sample_during_eval=false
Trainer Options
- BasicTrainer: Partitions models across multiple GPUs without significant speedup but increases memory availability.
- FSDPTrainer: Integrates PyTorch's Fully Sharded Data Parallel for more efficient use of GPU resources, advantageous in multi-GPU settings.
- TensorParallelTrainer: Experimental option using PyTorch tensor parallelism for sharding on GPUs.
Adding New Datasets
Adding custom datasets requires minimal effort. You need to adjust preference_datasets.py
to accommodate new datasets by establishing a dictionary structure that maps prompts to responses and prefers response pairings.
Tips for Optimized Training
Leveraging multiple GPUs with FSDP can accelerate training significantly. Users are encouraged to opt for batch sizes that maximize GPU utilization and consider mixed precision and activation checkpointing strategies to enhance efficiency.
Citing DPO
The repository and the research can be cited in academic work using a specified BibTeX entry, ensuring proper acknowledgment of the DPO methodology and contributions.
For researchers and developers aiming to incorporate preference optimization in language model training, the DPO repository offers robust tools and documented pathways to tailor the training process to diverse datasets and computational environments.