Pretraining Language Models with Human Preferences
The project, "Pretraining Language Models with Human Preferences," focuses on enhancing the pretraining of language models using human feedback. The project is built upon Hugging Face Transformers' Trainer
and incorporates various pretraining objectives that incorporate human preferences. This approach aims to ensure that language models generate text that aligns better with human expectations and values.
Overview of the Approach
Language models typically require large datasets to learn semantic and syntactic patterns. However, they may not always align with human preferences, such as non-offensiveness. The project addresses this issue by using human feedback to guide the training of language models.
Core Components
Human Feedback Objectives
This project implements five distinct objectives for pretraining with human feedback (PHF). These objectives use a scoring system that assigns rewards based on human preferences to different text pieces. This scoring is done using apo.scorers.Scorer
, which helps determine the alignment of generated text with human values.
Codebase and Structure
The codebase relies heavily on the Hugging Face ecosystem and wandb
for monitoring. It is designed modularly with various components responsible for evaluating or scoring text, managing datasets, and implementing loss functions.
Training and Evaluation
Training scripts are configured using YAML files specifying tasks and methods, allowing for replicable, experiment-driven training. For instance, a script to evaluate a language model on a toxicity task can be run with specific configuration paths. One can even override standard training parameters directly from the command line, providing flexibility in training setups.
Tasks Addressed
- Toxicity: Evaluates the likelihood of text to be offensive, using the
DetoxifyToxicityScorer
. - Personally Identifiable Information (PII): Gauges text for its content of personal data using
PIIScorer
. - Code Standards (PEP8): Checks for adherence to Python's PEP8 coding standards with
PEP8Scorer
.
Pretraining Objectives
The project explores several objectives:
- Maximum Likelihood Estimation (MLE): A basic form of pretraining using standard cross-entropy loss.
- Filtering, Conditional Training, Unlikelihood, and others: These methods help refine the model’s behavior by adjusting training inputs based on specific human feedback criteria.
Pretrained Models
Various pre-trained models for each objective and task combination are hosted on the Hugging Face Hub. These models have been trained with different objectives, such as MLE, Filtering, and Unlikelihood, to address tasks like toxicity detection, PII management, and PEP8 adherence in code.
Metrics and Evaluation
Evaluation involves generating text samples and scoring them for alignment with human preferences, calculating metrics to inform model refinement. The KL divergence of the model distribution from a reference model like GPT-3 is also estimated to measure how closely it matches human expectations.
Future Implications
This approach has the potential to improve language models significantly, ensuring they meet human ethical and societal standards more reliably. By training AI models that respect human preferences, the technology advances towards being safer and more inclusive in real-world applications.
With its structured approach and clear objectives, this project provides a fascinating glimpse into how human input can significantly enhance machine learning outcomes, encouraging further exploration and development in this field.