OpenFedLLM - Training Large Language Models through Federated Learning on Decentralized Data

OpenFedLLM: Collaborative Learning for Large Language Models

OpenFedLLM is a pioneering open-source project designed to facilitate the training of Large Language Models (LLMs) using federated learning, a decentralized approach that respects data privacy. This project provides a comprehensive toolkit for researchers and developers interested in harnessing the power of federated learning to train LLMs more effectively and ethically.

Key Features

OpenFedLLM includes several advanced features to support a range of learning and evaluation needs:

Federated Learning Algorithms: The project supports seven different federated learning algorithms, including widely-used methodologies like FedAvg, FedProx, and SCAFFOLD. These algorithms are central to enabling multiple devices to contribute to model training while keeping their data private.
LLM Training Algorithms: Two crucial training algorithms are incorporated – instruction tuning and value alignment. Instruction tuning (i.e., SFT) helps in tailoring the LLMs to understand and execute specific tasks accurately. Meanwhile, value alignment (i.e., DPO) ensures the models' decisions align with intended outcomes and ethical standards.
Evaluation Metrics: With over 30 evaluation metrics available, OpenFedLLM provides extensive tools to assess various model capabilities. These metrics span areas like general capabilities, domain-specific questions (medical and financial), code generation, and math solving.

Recent Developments

In June 2024, OpenFedLLM introduced FedLLM-Bench, the first realistic benchmark for evaluating federated learning in LLMs. This benchmark aids researchers in comparing and improving their models.

Getting Started

To begin using OpenFedLLM, users can clone the repository and set up the environment with the provided scripts. The setup process involves installing necessary packages and configuring the environment using the commands:

git clone --recursive --shallow-submodules https://github.com/rui-ye/OpenFedLLM.git
cd OpenFedLLM
conda create -n fedllm python=3.10
conda activate fedllm
pip install -r requirements.txt
source setup.sh

Training Processes

OpenFedLLM offers detailed scripts for training models using federated learning:

Federated Instruction Tuning: This process uses the script run_sft.sh. Key parameters like model_name_or_path, dataset_name, and fed_alg can be customized to fit specific needs for training models under different scenarios and conditions.
Federated Value Alignment: The script run_dpo.sh facilitates training with value alignment. This script focuses on ensuring that the models' decisions adhere to predefined ethical and practical templates, enabling more trustworthy outputs.

Evaluation Tools

The evaluation tools are organized in a dedicated directory, offering scripts aligned with high-impact open-source projects for rigorous model assessment. Specific evaluations include open-ended tests and comparisons using benchmarks like MT-Bench and Vicuna Bench.

Contribution and Acknowledgments

Researchers and developers who benefit from OpenFedLLM are encouraged to cite the work as per the following format:

@article{ye2024openfedllm,
  title={OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning},
  author={Ye, Rui and Wang, Wenhao and Chai, Jingyi and Li, Dihan and Li, Zexi and Xu, Yinda and Du, Yaxin and Wang, Yanfeng and Chen, Siheng},
  journal={arXiv preprint arXiv:2402.06954},
  year={2024}
}

OpenFedLLM represents a significant step in the evolution of machine learning practices, offering a way to train advanced language models while respecting data privacy and enabling collaboration without data sharing.