Introduction to ChatLM-mini-Chinese: A Compact Conversational Model
Overview
ChatLM-mini-Chinese is a compact conversational language model designed to efficiently operate on consumer-grade hardware. Drawing from the growing trend of large language models, this project focuses on creating a generative language model from scratch, including steps such as data cleansing, tokenizer training, model pre-training, fine-tuning with supervised instruction, and reinforcement learning with human feedback (RLHF). With only 0.2 billion parameters, ChatLM-mini-Chinese is lightweight enough to pre-train on machines with as little as 4GB of GPU memory, using batch size 1 and either fp16
or bf16
precision. For inference, the model requires a minimum of just 512MB of GPU memory when loaded in float16
.
Key Features
-
Transparency: Publicly available datasets for pre-training, instruction fine-tuning, and preference optimization.
-
Tools: Utilizes
Huggingface
NLP framework components, such astransformers
,accelerate
,trl
, andpeft
. -
Custom Trainer: Supports pre-training and fine-tuning on single or multiple GPUs, with the ability to pause and resume training at any point.
-
Pre-training: Focuses on end-to-end
Text-to-Text
pre-training, bypassing traditionalmask
prediction methods.- Comprehensive data cleansing and optimization processes are publicly documented.
- Tokenizer is trained with multi-process frequency statistics and supports
sentencepiece
andhuggingface tokenizers
. - Large datasets are efficiently streamed and buffered to minimize memory and disk usage, allowing for pre-training on systems with 16GB RAM and 4GB GPU memory.
- Pre-training logs are meticulously recorded.
-
SFT Fine-tuning: Provides access to the SFT dataset and processing workflow.
- Custom trainer supports prompt instruction fine-tuning, with flexibility to resume training from any saved point.
- Compatible with
Huggingface trainer
forsequence to sequence
fine-tuning. - Traditional low-learning-rate approaches focus on fine-tuning the decoder layers.
-
RLHF Preference Optimization: Conducted using Direct Preference Optimization (DPO).
- Supports
peft lora
for preference optimization, allowing integration ofLora adapter
into the base model.
- Supports
-
Downstream Task Fine-tuning: Demonstrated with a triplet information extraction task. Post-fine-tuning, the model maintains conversational capabilities.
Recent Updates
The project regularly receives updates to improve its datasets, code, and documentation for enhanced performance and usability. Key updates have included model version enhancements, data cleansing techniques, and tokenizer improvements.
Data Collection
The model has been trained using publicly available single-turn dialogue datasets from various sources, which have been cleaned and formatted. Major datasets used include:
- Community Q&A data from webtext2019zh with a refined dataset of 2.6 million entries.
- Baike QA entries reduced to 1.3 million entries after cleansing.
- 790,000 entries from a Chinese medical dialogue dataset.
- Zhihu Q&A data with 970,000 entries post-cleaning.
- Portions of BELLE’s open-source instruction training data resulting in 3.38 million entries.
- Wikipedia prompts and responses constituting 1.19 million entries post-cleaning.
In total, the pre-training data contains 9.3 million sequences, with evaluation datasets consisting of smaller sizes to speed up the assessment process.
Model Specifications
The project employs a T5 model adapted to language transfer tasks, initially based on the T5-base
with adjustments to both encoder and decoder layers, reducing their number to 10. The model features 0.2 billion parameters and a vocabulary size encompassing 29,298 tokens.
Hardware Requirements
The training and fine-tuning phases were conducted on machines equipped with high-performance CPUs and GPUs, underscoring the project's commitment to optimizing computational resource utilization.
Conclusion
ChatLM-mini-Chinese showcases an innovative approach to developing compact and efficient language models that are accessible and functional on consumer-grade hardware. This project is particularly valuable for users interested in training or deploying cost-effective AI solutions without the need for extensive computational infrastructure.