ChatLM-mini-Chinese - Refined Chinese Generative Model Adapted for Low-Resource Training

Introduction to ChatLM-mini-Chinese: A Compact Conversational Model

Overview

ChatLM-mini-Chinese is a compact conversational language model designed to efficiently operate on consumer-grade hardware. Drawing from the growing trend of large language models, this project focuses on creating a generative language model from scratch, including steps such as data cleansing, tokenizer training, model pre-training, fine-tuning with supervised instruction, and reinforcement learning with human feedback (RLHF). With only 0.2 billion parameters, ChatLM-mini-Chinese is lightweight enough to pre-train on machines with as little as 4GB of GPU memory, using batch size 1 and either fp16 or bf16 precision. For inference, the model requires a minimum of just 512MB of GPU memory when loaded in float16.

Key Features

Transparency: Publicly available datasets for pre-training, instruction fine-tuning, and preference optimization.
Tools: Utilizes Huggingface NLP framework components, such as transformers, accelerate, trl, and peft.
Custom Trainer: Supports pre-training and fine-tuning on single or multiple GPUs, with the ability to pause and resume training at any point.
Pre-training: Focuses on end-to-end Text-to-Text pre-training, bypassing traditional mask prediction methods.
- Comprehensive data cleansing and optimization processes are publicly documented.
- Tokenizer is trained with multi-process frequency statistics and supports sentencepiece and huggingface tokenizers.
- Large datasets are efficiently streamed and buffered to minimize memory and disk usage, allowing for pre-training on systems with 16GB RAM and 4GB GPU memory.
- Pre-training logs are meticulously recorded.
SFT Fine-tuning: Provides access to the SFT dataset and processing workflow.
- Custom trainer supports prompt instruction fine-tuning, with flexibility to resume training from any saved point.
- Compatible with Huggingface trainer for sequence to sequence fine-tuning.
- Traditional low-learning-rate approaches focus on fine-tuning the decoder layers.
RLHF Preference Optimization: Conducted using Direct Preference Optimization (DPO).
- Supports peft lora for preference optimization, allowing integration of Lora adapter into the base model.
Downstream Task Fine-tuning: Demonstrated with a triplet information extraction task. Post-fine-tuning, the model maintains conversational capabilities.

Recent Updates

The project regularly receives updates to improve its datasets, code, and documentation for enhanced performance and usability. Key updates have included model version enhancements, data cleansing techniques, and tokenizer improvements.

Data Collection

The model has been trained using publicly available single-turn dialogue datasets from various sources, which have been cleaned and formatted. Major datasets used include:

Community Q&A data from webtext2019zh with a refined dataset of 2.6 million entries.
Baike QA entries reduced to 1.3 million entries after cleansing.
790,000 entries from a Chinese medical dialogue dataset.
Zhihu Q&A data with 970,000 entries post-cleaning.
Portions of BELLE’s open-source instruction training data resulting in 3.38 million entries.
Wikipedia prompts and responses constituting 1.19 million entries post-cleaning.

In total, the pre-training data contains 9.3 million sequences, with evaluation datasets consisting of smaller sizes to speed up the assessment process.

Model Specifications

The project employs a T5 model adapted to language transfer tasks, initially based on the T5-base with adjustments to both encoder and decoder layers, reducing their number to 10. The model features 0.2 billion parameters and a vocabulary size encompassing 29,298 tokens.

Hardware Requirements

The training and fine-tuning phases were conducted on machines equipped with high-performance CPUs and GPUs, underscoring the project's commitment to optimizing computational resource utilization.

Conclusion

ChatLM-mini-Chinese showcases an innovative approach to developing compact and efficient language models that are accessible and functional on consumer-grade hardware. This project is particularly valuable for users interested in training or deploying cost-effective AI solutions without the need for extensive computational infrastructure.