safe-rlhf - Safe RLHF Framework for Enhanced AI Model Alignment

Introduction to the Safe RLHF Project

The Safe RLHF (Reinforcement Learning from Human Feedback) project, also known as the Beaver framework, is an open-source initiative developed by the PKU-Alignment team at Peking University. The project is designed to advance research in the alignment of large language models (LLMs) using reinforcement learning methods, particularly focusing on safety and constraint compliance through innovative approaches like Safe RLHF.

Key Features of Beaver

Compatibility with Popular Models: Beaver supports training methods like Supervised Fine-Tuning (SFT), RLHF, and Safe RLHF for leading pre-trained models such as LLaMA, OPT, and Baichuan.
Extensive Dataset: It includes a large, human-labeled dataset comprising up to one million pairs highlighting both helpful and harmless preferences, crucial for reproducible RLHF research.
Comprehensive Model Training Support: Beaver facilitates the training of both Reward Models and Cost Models, providing pre-trained model checkpoints.
Customization Flexibility: Offers the ability to customize parameters and datasets, enhancing SFT and RLHF processes.
Safety Verification Metrics: Utilizes multi-scale metrics like BIG-bench and GPT-4 Evaluation to verify safety constraints.

Recent Updates

June 2024: The launch of the PKU-SafeRLHF dataset version 1.0, which includes human-AI joint annotations and explores various harm categories with severity labels.
January 2024: Safe RLHF method acceptance at ICLR Spotlight, indicating its significance in machine learning research.
October 2023: The detailed Safe RLHF paper was published, outlining a new algorithm for safe alignment and its practical implementations.

Unique Position in the Ecosystem

Safe RLHF distinguishes itself by being the first framework to fully integrate all elements of reinforcement learning from SFT to evaluation, with a special emphasis on safety preferences. It provides a theoretical underpinning for ensuring parameter constraints in the policy space, making it a robust choice for researchers focused on safety in AI.

PKU-SafeRLHF Dataset

This dataset is pivotal as it contains constraints across multiple dimensions like insults, crime, and privacy, contributing to fine-grained value alignment. The dataset is continually updated and available on Hugging Face, ensuring reproducibility and utility for academic and research purposes.

Why the Name "Beaver"?

The name reflects the project's goal of creating a reliable and safe ecosystem for LLMs, akin to how beavers engineer their habitats to sustain other species. It symbolizes the framework’s effort to construct a safety net for AI models using Constrained Value Alignment (CVA) technology to reduce bias and enhance AI safety.

Beaver vs. Alpaca

Beaver enhances the foundational steps laid by the Alpaca model by integrating human feedback data and utilizing Safe RLHF for training, thereby improving upon Alpaca's safety without sacrificing performance.

Setting Up and Training with Beaver

The project is accessible via GitHub and can be set up using conda or docker for environment isolation. Beaver supports a seamless pipeline from SFT through preference model training to RLHF alignment. The complete process is optimized for setups with robust computational resources and can adjust for different dataset requirements by leveraging modular datasets, some of which are open source.

Future and Contributions

Beaver aims to contribute significantly to the development of safer AI technologies while inviting community involvement through open-source distribution and collaborative improvements. Researchers and developers are encouraged to explore its comprehensive offerings for advancing safe AI alignments.

By presenting a combined focus on safety and performance, the Beaver project is not only a framework for today but also a stepping stone for the more challenging AI alignment tasks of tomorrow.