PaLM-rlhf-pytorch - Integrating Human Feedback in PaLM-based Reinforcement Learning

PaLM + RLHF - Pytorch (Work in Progress)

The PaLM + RLHF - Pytorch project is an exciting initiative aimed at implementing Reinforcement Learning with Human Feedback (RLHF) on top of the Pathways Language Model (PaLM) architecture. This project endeavors to replicate functionalities similar to those of advanced language models like ChatGPT. Additionally, there are intentions to add retrieval functionalities akin to the RETRO project, enhancing the model's access to information from large datasets.

Background and Objective

The primary goal of this project is to create an open-source implementation of RLHF, which could potentially help developers build language models comparable to ChatGPT. Reinforcement Learning with Human Feedback is a method where human feedback is integrated into the learning process of a model, helping it achieve better performance and accuracy in tasks such as language understanding and generation.

Project Community and Collaboration

This project is not being developed in isolation. It has connections and overlaps with other projects and researchers, such as CarperAI and the TRLX framework, which were exploring RLHF even before ChatGPT's debut. Moreover, prominent figures like Yannic Kilcher are contributing by working on open-sourced implementations.

Contribution and Appreciation

The project has received generous support and sponsorship from forward-thinking organizations like Stability.ai, which funds cutting-edge artificial intelligence research. Hugging Face and CarperAI also contribute significantly by providing platforms for collaboration and sharing valuable insights into RLHF through educational content.

Installation and Usage

To contribute or experiment with this project, it can be easily installed via pip:

$ pip install palm-rlhf-pytorch

Training the PaLM Model

The project guides users on how to train a PaLM model comparable to other autoregressive transformers using PyTorch. Here's a simple example to start with:

import torch
from palm_rlhf_pytorch import PaLM

palm = PaLM(
    num_tokens = 20000,
    dim = 512,
    depth = 12,
    flash_attn = True
).cuda()

seq = torch.randint(0, 20000, (1, 2048)).cuda()

loss = palm(seq, return_loss = True)
loss.backward()

Once trained, the model can generate sequences, showcasing its learning capability after extensive training.

Reward Model Training

Additionally, the project includes guidance on training a reward model using curated human feedback, which is essential for refining the output quality and ensuring alignment with human expectations.

import torch
from palm_rlhf_pytorch import PaLM, RewardModel

palm = PaLM(
    num_tokens = 20000,
    dim = 512,
    depth = 12,
    causal = False
)

reward_model = RewardModel(
    palm,
    num_binned_output = 5
).cuda()

seq = torch.randint(0, 20000, (1, 1024)).cuda()
prompt_mask = torch.zeros(1, 1024).bool().cuda()
labels = torch.randint(0, 5, (1,)).cuda()

loss = reward_model(seq, prompt_mask = prompt_mask, labels = labels)
loss.backward()

RLHF Trainer

The project also provides a framework to integrate both the trained transformer and the reward model with the RLHF trainer. This integration facilitates a reinforcement learning loop, ultimately ensuring that the model learns from feedback efficiently.

import torch
from palm_rlhf_pytorch import PaLM, RewardModel, RLHFTrainer

palm = PaLM(
    num_tokens = 20000,
    dim = 512,
    depth = 12
).cuda()

palm.load('./path/to/pretrained/palm.pt')

reward_model = RewardModel(
    palm,
    num_binned_output = 5
).cuda()

reward_model.load('./path/to/pretrained/reward_model.pt')

prompts = torch.randint(0, 256, (50000, 512)).cuda()

trainer = RLHFTrainer(
    palm = palm,
    reward_model = reward_model,
    prompt_token_ids = prompts
)

trainer.train(num_episodes = 50000)

Future Plans and Improvements

The project has a comprehensive list of future enhancements, such as incorporating Hugging Face's accelerate library, exploring state-of-the-art (SOTA) approaches for Proximal Policy Optimization (PPO), and even developing a simple web interface for collecting human feedback. Additionally, there is a continuous effort to integrate the best practices and advancements within the field to maintain relevance and effectiveness.

Citations

The project acknowledges numerous academic contributions, ranging from initial research on summarizing human feedback to cutting-edge attention stabilization techniques. These citations provide a backbone of scholarly research and development on which this project builds.

By applying these foundations, PaLM + RLHF - Pytorch seeks to advance the capabilities of language models while making these advancements accessible to developers and researchers globally.