build_MiniLLM_from_scratch - Compact LLM Chat Model with Bert4torch Framework

Introduction to build_MiniLLM_from_scratch

1. Overview

The project "build_MiniLLM_from_scratch" aims at constructing a miniature large language model (LLM) through stages such as pre-training, instruction fine-tuning, reward modeling, and reinforcement learning. The primary goal is to develop a chat model capable of performing simple conversational tasks with a controlled cost. Currently, the project has completed the first two stages.

Key Features

Utilizes the Bert4Torch training framework, known for its simple and efficient code.
Trained checkpoints seamlessly integrate with the transformers package, providing flexibility for inference.
Optimized file reading approach to enhance memory usage during training.
Complete training logs are provided for reproducibility and comparison.
Includes a self-recognition dataset allowing customization of robot attributes such as name and author.
The chat model supports multi-turn conversations.

Disclaimer

The model developed through this project is currently limited to simple conversational functionality due to constraints in corpus size, model scale, and the size and quality of fine-tuning data. It is not equipped to answer complex questions.

2. Getting Started

Environment Installation

To set up the environment, use the following commands:

pip install git+https://github.com/Tongjilibo/torch4keras.git
pip install git+https://github.com/Tongjilibo/bert4torch.git@dev

Script Instructions

Pretraining
- Navigate to the pretrain directory and execute the pretrain script:
```
cd pretrain
torchrun --standalone --nproc_per_node=4 pretrain.py
```
Pretraining Inference (Command-line Chat)
```
cd pretrain
python infer.py
```
Instruction Fine-Tuning
```
cd sft
python sft.py
```
Instruction Fine-Tuning Inference (Command-line Chat)
```
cd sft
python infer.py
```
Convert Checkpoints to Transformers Format
```
cd docs
python convert.py
```

3. Update History

April 3, 2024: Introduced MiniLLM-0.2B-WithWudao-SFT, trained on 11.57 million samples, supporting multi-round conversations.
March 25, 2024: Introduced a new 1.1B model.
March 16, 2024: Initial commit with pre-trained models MiniLLM-0.2B-NoWudao and MiniLLM-0.2B-WithWudao; and SFT model MiniLLM-0.2B-WithWudao-SFT_Alpaca.

4. Pre-Training

4.1 Pre-Training Corpus

The pre-training corpus includes a variety of datasets such as Chinese Wikipedia, BaiduBaike, C4_zh, WuDaoCorpora, and a portion of medical data from shibing624. The total corpus consists of 63.4 billion tokens.

4.2 Pre-Training Weights and Process

Three sets of pre-training weights have been developed, each with different configurations and dataset coverage. A summary of these configurations includes:

MiniLLM-0.2B-NoWudao trained on 14 billion tokens utilizing datasets such as Chinese Wikipedia and medical data. This was achieved using 4 A800 GPUs for a duration of 20 hours.
MiniLLM-0.2B-WithWudao was trained on 64 billion tokens, covering a larger dataset selection, consuming 4 A800 GPUs over 3.79 days.
MiniLLM-1.1B-WithWudao involved a more extensive setup, using 8 A800 GPUs and completing in a single day.

4.3 Using the Pre-trained Model

A demonstration of using the pre-trained weights utilizing the transformers library is shown below:

from transformers import AutoTokenizer, LlamaForCausalLM
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'Tongjilibo/MiniLLM-0.2B-WithWudao'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = LlamaForCausalLM.from_pretrained(model_name).to(device)

query = '王鹏是一名'
inputs = tokenizer.encode(query, return_tensors='pt', add_special_tokens=False).to(device)
output_ids = model.generate(inputs)
response = tokenizer.decode(output_ids[0].cpu(), skip_special_tokens=True)
print(response)

5. Instruction Fine-Tuning

5.1 Instruction Fine-Tuning Data

The selection of datasets for instruction fine-tuning includes various high-quality resources such as Alpaca-ZH, BelleGroup's datasets, and others focusing on multi-turn dialogues and diverse task instructions.

5.2 Fine-Tuning Weights and Process

Instruction fine-tuning was performed on multiple weights, notably:

MiniLLM-0.2B-WithWudao-SFT_Alpaca was trained using over 40,000 samples on a single 4090 GPU for 45 minutes.
MiniLLM-0.2B-WithWudao-SFT utilized 11.57 million samples, executed on dual A800 GPUs for approximately 4.5 days.

These processes aim to refine the model's conversational and task-following abilities, enhancing its utility in chat applications.