Introduction to build_MiniLLM_from_scratch
1. Overview
The project "build_MiniLLM_from_scratch" aims at constructing a miniature large language model (LLM) through stages such as pre-training, instruction fine-tuning, reward modeling, and reinforcement learning. The primary goal is to develop a chat model capable of performing simple conversational tasks with a controlled cost. Currently, the project has completed the first two stages.
Key Features
- Utilizes the Bert4Torch training framework, known for its simple and efficient code.
- Trained checkpoints seamlessly integrate with the
transformers
package, providing flexibility for inference. - Optimized file reading approach to enhance memory usage during training.
- Complete training logs are provided for reproducibility and comparison.
- Includes a self-recognition dataset allowing customization of robot attributes such as name and author.
- The chat model supports multi-turn conversations.
Disclaimer
The model developed through this project is currently limited to simple conversational functionality due to constraints in corpus size, model scale, and the size and quality of fine-tuning data. It is not equipped to answer complex questions.
2. Getting Started
Environment Installation
To set up the environment, use the following commands:
pip install git+https://github.com/Tongjilibo/torch4keras.git
pip install git+https://github.com/Tongjilibo/bert4torch.git@dev
Script Instructions
-
Pretraining
- Navigate to the pretrain directory and execute the pretrain script:
cd pretrain torchrun --standalone --nproc_per_node=4 pretrain.py
-
Pretraining Inference (Command-line Chat)
cd pretrain python infer.py
-
Instruction Fine-Tuning
cd sft python sft.py
-
Instruction Fine-Tuning Inference (Command-line Chat)
cd sft python infer.py
-
Convert Checkpoints to Transformers Format
cd docs python convert.py
3. Update History
- April 3, 2024: Introduced MiniLLM-0.2B-WithWudao-SFT, trained on 11.57 million samples, supporting multi-round conversations.
- March 25, 2024: Introduced a new 1.1B model.
- March 16, 2024: Initial commit with pre-trained models MiniLLM-0.2B-NoWudao and MiniLLM-0.2B-WithWudao; and SFT model MiniLLM-0.2B-WithWudao-SFT_Alpaca.
4. Pre-Training
4.1 Pre-Training Corpus
The pre-training corpus includes a variety of datasets such as Chinese Wikipedia, BaiduBaike, C4_zh, WuDaoCorpora, and a portion of medical data from shibing624. The total corpus consists of 63.4 billion tokens.
4.2 Pre-Training Weights and Process
Three sets of pre-training weights have been developed, each with different configurations and dataset coverage. A summary of these configurations includes:
-
MiniLLM-0.2B-NoWudao trained on 14 billion tokens utilizing datasets such as Chinese Wikipedia and medical data. This was achieved using 4 A800 GPUs for a duration of 20 hours.
-
MiniLLM-0.2B-WithWudao was trained on 64 billion tokens, covering a larger dataset selection, consuming 4 A800 GPUs over 3.79 days.
-
MiniLLM-1.1B-WithWudao involved a more extensive setup, using 8 A800 GPUs and completing in a single day.
4.3 Using the Pre-trained Model
A demonstration of using the pre-trained weights utilizing the transformers
library is shown below:
from transformers import AutoTokenizer, LlamaForCausalLM
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'Tongjilibo/MiniLLM-0.2B-WithWudao'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = LlamaForCausalLM.from_pretrained(model_name).to(device)
query = '王鹏是一名'
inputs = tokenizer.encode(query, return_tensors='pt', add_special_tokens=False).to(device)
output_ids = model.generate(inputs)
response = tokenizer.decode(output_ids[0].cpu(), skip_special_tokens=True)
print(response)
5. Instruction Fine-Tuning
5.1 Instruction Fine-Tuning Data
The selection of datasets for instruction fine-tuning includes various high-quality resources such as Alpaca-ZH, BelleGroup's datasets, and others focusing on multi-turn dialogues and diverse task instructions.
5.2 Fine-Tuning Weights and Process
Instruction fine-tuning was performed on multiple weights, notably:
-
MiniLLM-0.2B-WithWudao-SFT_Alpaca was trained using over 40,000 samples on a single 4090 GPU for 45 minutes.
-
MiniLLM-0.2B-WithWudao-SFT utilized 11.57 million samples, executed on dual A800 GPUs for approximately 4.5 days.
These processes aim to refine the model's conversational and task-following abilities, enhancing its utility in chat applications.