Introduction to the Chinese-Mixtral Project
Overview
The Chinese-Mixtral project is developed based on Mistral.ai's Mixtral model, featuring a Sparse Mixture of Experts (Sparse MoE) architecture. The project has conducted incremental Chinese training on large-scale unlabeled Chinese data to produce the Chinese-Mixtral base model. Further instruction fine-tuning was applied to create the Chinese-Mixtral-Instruct model. Notably, this model inherently supports 32K context length (tested up to 128K), enabling effective handling of long texts and showing significant performance improvements in areas like mathematical reasoning and code generation. When using llama.cpp for quantization inference, it requires a minimum of 16G memory (or VRAM).
Main Features
- Open-Source Models: The project has open-sourced the Chinese-Mixtral base model, trained incrementally on the Mixtral-8x7B-v0.1 model, and the Chinese-Mixtral-Instruct instruction model, fine-tuned for specific instructions.
- Training Scripts and Tutorials: It provides scripts for both pretraining and instruction fine-tuning, allowing users to further train or fine-tune the models as needed.
- Efficient Deployment: Tutorials are available for quickly deploying and quantizing large models on personal computers using CPU/GPU.
- Ecosystem Compatibility: Supported ecosystems include 🤗Transformers, llama.cpp, and more, enhancing accessibility for various applications.
Sparse Mixture of Experts Model
The Mixtral model stands out with its unique Sparse Mixture of Experts structure, featuring:
- Eight distinct "experts" (fully connected layers) per FFN layer, with only the two most optimal being activated based on gating values.
- Each token in the input sequence selects experts independently, rather than applying the same set to the entire sequence.
- Though the model has around 46.7 billion parameters, only about 13 billion are activated during inference.
Extended Context Support
Unlike models such as Chinese-LLaMA-Alpaca, Mixtral natively supports a 32K context length, tested up to 128K. This allows users to tackle tasks requiring varied text lengths with a single model.
Model Download
The project offers models in different formats to cater to varying needs:
- Full Models: Ready for immediate use, recommended for users with sufficient network bandwidth.
- LoRA Models: Require merging with the original Mixtral-8x7B-v0.1 model, suitable for users with limited bandwidth.
- GGUF Models: Quantized models for inference and deployment, recommended for users focusing solely on deployment.
Inference and Deployment
Several methods are available for model quantization, inference, and deployment, such as:
- llama.cpp for efficient local inference with various quantization options.
- 🤗Transformers and mimicry of OpenAI API for deploying models with interfaces similar to popular frameworks.
- text-generation-webui for deploying with a GUI for easier interaction.
Model Performance
The project's models are evaluated through both qualitative and quantitative metrics:
- Qualitative Assessment: Model responses can be compared via an online arena platform, which provides matchups and Elo ratings.
- Quantitative Assessment: Objective tests like C-Eval offer comprehensive evaluations across various subjects with multiple-choice questions, assessing the models' understanding and performance in a broad spectrum of topics.
The Chinese-Mixtral project represents a significant advancement in model adaptation for the Chinese language, providing versatile tools and models for developers and researchers. By supporting extensive context lengths and demonstrating robust performance in specialized tasks, it sets a benchmark for language-specific model development.