Chinese-Mixtral-8x7B - Enhanced NLP Capabilities with Expanded Chinese Vocabulary

Introduction to Chinese-Mixtral-8x7B

Chinese-Mixtral-8x7B is an innovative project aimed at enhancing the capabilities of bilingual models in understanding and generating Chinese text. It builds upon Mixtral-8x7B, a model developed by Mistral, and adds an extended Chinese vocabulary to improve encoding and decoding efficiency. This advancement is achieved by training the model with a significant amount of publicly available Chinese text.

The project offers the following open-source resources:

A large-scale Chinese Mixtral-8x7B model with an extended vocabulary.
Code for incremental pre-training to adapt this extended vocabulary to the model.

Key News and Updates

February 9, 2024: The fine-tuned version of Chinese-Mixtral-8x7B, named "Huozi 3.0," was released. Additionally, the instruction tuning code was made publicly available.
January 18, 2024: The base model of Chinese-Mixtral-8x7B and the incremental pre-training code were released.

Downloading the Model

Chinese-Mixtral-8x7B can be downloaded in different forms depending on your needs. It uses QLoRA for training, and both the LoRA weights and the combined model weights are available for download:

Chinese-Mixtral-8x7B (88GB): Available on HuggingFace and ModelScope. This version includes the full model with the extended Chinese vocabulary.
Chinese-Mixtral-8x7B-adapter (2.7GB): Available on HuggingFace. It includes only the LoRA weights and needs to be merged with the original Mixtral-8x7B to be used. The merge script can be found here.

Model Inference

Chinese-Mixtral-8x7B integrates compatibly with the whole Mixtral-8x7B model ecosystem. It can be accelerated using vLLM or Flash Attention 2, and model quantification can be performed using bitsandbytes.

Using Flash Attention 2 Example:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "HIT-SCIR/Chinese-Mixtral-8x7B"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16, device_map="auto")

text = "我的名字是"
inputs = tokenizer(text, return_tensors="pt").to(0)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using 4-bit Quantization Example:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "HIT-SCIR/Chinese-Mixtral-8x7B"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")

text = "我的名字是"
inputs = tokenizer(text, return_tensors="pt").to(0)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Model Performance

Comprehensive Abilities

Chinese-Mixtral-8x7B's performance was evaluated using datasets designed to test both Chinese and English capabilities:

C-Eval: A comprehensive Chinese evaluation suite with questions across various subjects and difficulty levels.
CMMLU: Assesses Chinese language models on knowledge and reasoning regarding Chinese context across multiple topics.
MMLU: English tasks covering math, history, computer science, and more, for benchmarking large language models.
HellaSwag: Challenges the model's understanding of context in English NLI tasks.

Despite using only a fraction of the training data other models had, Chinese-Mixtral-8x7B showed notable potential with significant English understanding and generation capabilities.

Generation Effectiveness and Efficiency

Chinese-Mixtral-8x7B demonstrated competitive performance in generation tasks, with high efficiency in encoding and decoding Chinese text due to its optimized vocabulary. This benefits the model's speed and effectiveness in tasks involving complex reasoning or long sequence processing.

Training Details

Vocabulary Expansion

The project extended the original vocabulary using sentencepiece, trained on large Chinese datasets to determine optimal token counts and configurations. The chosen vocabulary was then used to initialize new model embeddings.

Incremental Pre-Training

To manage large-scale model training resourcefully, QLoRA was used to reduce memory requirements while maintaining performance levels akin to full parameter training. This strategy allowed for effective training without extensive hardware resources.

Overall, the Chinese-Mixtral-8x7B project represents a significant step forward in natural language processing, particularly for tasks requiring deep understanding and generation of Chinese language content.