Phi2-mini-Chinese - Develop a Mini Chinese Language Model Utilizing Key Pre-processing and Optimization Strategies

Introduction to the Phi2-mini-Chinese Project

The Phi2-mini-Chinese project is an experimental initiative focused on training a small-scale Chinese language model from scratch. The project provides open-source code and model weights, although due to the limited data available for pretraining, those seeking a more powerful Chinese language model might explore the ChatLM-mini-Chinese project as a reference.

Features and Considerations

The project is experimental, with potential for major changes in training data, model structure, and file organization.
Flash attention 2 is supported for faster processing.

Data Processing

Data cleaning is crucial to ensure quality input. This involves tasks like appending periods to sentences, converting traditional Chinese characters to simplified ones, removing excessive punctuation, and normalizing Unicode characters. Detailed processes can be found in the ChatLM-mini-Chinese project repository.

Tokenizer Training

The project utilizes a "byte level" BPE tokenizer. Alongside, training codes for "char level" and other tokenizers are provided. Post-training, it is essential to verify the vocabulary, ensuring it includes special characters like \t and \n. Memory-intense, tokenizer training requires ample resources, typically around 32GB RAM.

CLM Pretraining

The causal language model (CLM) undergoes unsupervised pretraining using datasets like BELLE. The process involves inputting and outputting the same text to calculate cross-entropy loss. Incorporating [EOS] markers after each entry is recommended and similar practices apply to other data formats.

SFT Instruction Tuning

The project leverages "bell open source" datasets for instruction fine-tuning (SFT). The SFT training format is designed to prioritize response generation, ignoring text before the "##回答:" marker. Adding an [EOS] token ensures proper termination during text decoding.

RLHF Optimization

Incorporating a simplified DPO for preference optimization, encourages efficient memory usage. The process involves dual models, where one is trainable and the other is static as a reference. Careful construction of the dataset, segregating 'prompt', 'chosen', and 'rejected' data, is advised for effective optimization.

Application and Usage

General Conversation

The model is available on Hugging Face, under the repository "Phi2-Chinese-0.2B." Using a specific API, users can prompt the model for conversational interactions.

Retrieval-Based Generation

For retrieval-based conversational tasks, guidance and examples can be found in the notebook rag_with_langchain.ipynb.

Citation

If the Phi2-mini-Chinese project has been beneficial, a suggested citation template is available, authored by Charent Chen in 2023, sourced from GitHub.

Final Notes

The project does not assume liability for data security risks or misuse resulting from the open-source model and code. Users are advised to proceed with caution and responsibility.