Introduction to the Phi2-mini-Chinese Project
The Phi2-mini-Chinese project is an experimental initiative focused on training a small-scale Chinese language model from scratch. The project provides open-source code and model weights, although due to the limited data available for pretraining, those seeking a more powerful Chinese language model might explore the ChatLM-mini-Chinese project as a reference.
Features and Considerations
- The project is experimental, with potential for major changes in training data, model structure, and file organization.
- Flash attention 2 is supported for faster processing.
Data Processing
Data cleaning is crucial to ensure quality input. This involves tasks like appending periods to sentences, converting traditional Chinese characters to simplified ones, removing excessive punctuation, and normalizing Unicode characters. Detailed processes can be found in the ChatLM-mini-Chinese project repository.
Tokenizer Training
The project utilizes a "byte level" BPE tokenizer. Alongside, training codes for "char level" and other tokenizers are provided. Post-training, it is essential to verify the vocabulary, ensuring it includes special characters like \t
and \n
. Memory-intense, tokenizer training requires ample resources, typically around 32GB RAM.
CLM Pretraining
The causal language model (CLM) undergoes unsupervised pretraining using datasets like BELLE. The process involves inputting and outputting the same text to calculate cross-entropy loss. Incorporating [EOS] markers after each entry is recommended and similar practices apply to other data formats.
SFT Instruction Tuning
The project leverages "bell open source" datasets for instruction fine-tuning (SFT). The SFT training format is designed to prioritize response generation, ignoring text before the "##回答:" marker. Adding an [EOS] token ensures proper termination during text decoding.
RLHF Optimization
Incorporating a simplified DPO for preference optimization, encourages efficient memory usage. The process involves dual models, where one is trainable and the other is static as a reference. Careful construction of the dataset, segregating 'prompt', 'chosen', and 'rejected' data, is advised for effective optimization.
Application and Usage
General Conversation
The model is available on Hugging Face, under the repository "Phi2-Chinese-0.2B." Using a specific API, users can prompt the model for conversational interactions.
Retrieval-Based Generation
For retrieval-based conversational tasks, guidance and examples can be found in the notebook rag_with_langchain.ipynb
.
Citation
If the Phi2-mini-Chinese project has been beneficial, a suggested citation template is available, authored by Charent Chen in 2023, sourced from GitHub.
Final Notes
The project does not assume liability for data security risks or misuse resulting from the open-source model and code. Users are advised to proceed with caution and responsibility.