direct-preference-optimization - Enhance Language Model Training with Direct Preference Optimization Techniques

DPO: Direct Preference Optimization

The DPO project represents a significant stride in training language models using preference data, as elucidated in the paper, "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." This repository not only houses the original DPO algorithm but also introduces 'conservative' DPO and IPO variations, enhancing the spectrum of training capabilities for language models.

Understanding the Repository

This repository offers a reference implementation of the DPO algorithm, tailored for training language models using preference data. Notably, the implementation supports any causal model from the HuggingFace library, making it versatile and easily adaptable for various datasets.

Key Components of the DPO Pipeline

The DPO training approach involves two primary stages:

Supervised Fine-Tuning (SFT): This initial step involves refining the language model on the chosen datasets to ensure the model aligns closely with the existing data distribution.
Preference Learning: Utilizing preference data obtained from similar data distributions, this stage involves training the refined model further to internalize the preferences.

Core Files in the Repository

train.py: Serves as the main entry point to the training process, catering to both SFT and DPO-based training.
trainers.py: Hosts classes that define the training procedure, including support for multi-GPU setups.
utils.py: Includes utility functions employed across various files in the repository.
preference_datasets.py: Manages dataset processing for SFT and DPO training. This file is critical if users want to incorporate their own datasets.

Implementation Process

Running Supervised Fine-Tuning (SFT)

To align the model with preference data, begin with SFT:

For instance, running SFT on the Pythia 6.9B model with Anthropic-HH data involves:

python -u train.py model=pythia69 datasets=[hh] loss=sft exp_name=anthropic_dpo_pythia69 gradient_accumulation_steps=2 batch_size=64 eval_batch_size=32 trainer=FSDPTrainer sample_during_eval=false

Running Direct Preference Optimization (DPO)

Building on the SFT model, initiate DPO by incorporating:

loss=dpo
loss.beta=DESIRED_BETA (where typical values range from 0.1 to 0.5)

For example, to execute DPO on the Pythia 6.9B model, you would use:

python -u train.py model=pythia69 datasets=[hh] loss=dpo loss.beta=0.1 model.archive=/path/to/checkpoint/from/sft/step-XXXX/policy.pt exp_name=anthropic_dpo_pythia69 gradient_accumulation_steps=2 batch_size=32 eval_batch_size=32 trainer=FSDPTrainer sample_during_eval=false

Trainer Options

BasicTrainer: Partitions models across multiple GPUs without significant speedup but increases memory availability.
FSDPTrainer: Integrates PyTorch's Fully Sharded Data Parallel for more efficient use of GPU resources, advantageous in multi-GPU settings.
TensorParallelTrainer: Experimental option using PyTorch tensor parallelism for sharding on GPUs.

Adding New Datasets

Adding custom datasets requires minimal effort. You need to adjust preference_datasets.py to accommodate new datasets by establishing a dictionary structure that maps prompts to responses and prefers response pairings.

Tips for Optimized Training

Leveraging multiple GPUs with FSDP can accelerate training significantly. Users are encouraged to opt for batch sizes that maximize GPU utilization and consider mixed precision and activation checkpointing strategies to enhance efficiency.

Citing DPO

The repository and the research can be cited in academic work using a specified BibTeX entry, ensuring proper acknowledgment of the DPO methodology and contributions.

For researchers and developers aiming to incorporate preference optimization in language model training, the DPO repository offers robust tools and documented pathways to tailor the training process to diverse datasets and computational environments.