Firefly-LLaMA2-Chinese - Efficient Bilingual Model Training with Extended Vocabulary

Firefly-LLaMA2-Chinese: An Open-Source Chinese LLaMA2 Model

Introduction

Firefly-LLaMA2-Chinese is part of an ongoing effort to expand the capabilities of low-resource incremental pretraining. This project aligns with the Firefly goals, supporting incremental pretraining for native Chinese models like Baichuan2 and Qwen, as well as expanding the Chinese vocabulary for English models such as LLaMA2 and Falcon, before moving forward with incremental pretraining.

The team has released the Firefly-LLaMA2-Chinese, a bilingual model series that extends the Chinese vocabulary of the LLaMA2 as the base model. They trained the model incrementally using 22GB of Chinese and English pretraining data and further refined it with large-scale, multi-turn dialogue instruction. The evaluation results reveal competitive performance compared to existing open-source projects.

The Firefly-LLaMA2-Chinese model has outperformed several notable models on CMMLU and the Open LLM Leaderboard. Remarkably, the project team accomplished this using minimal resources, leveraging only up to four V100 GPUs during incremental pretraining and instruction fine-tuning phases, demonstrating a more resource-efficient approach compared to other projects.

In addition to releasing model weights, the Firefly team has shared full training codes, data, and detailed training processes, making the entire methodology open-source.

Key Contributions

The project expanded the LLaMA2 model's Chinese vocabulary, enhancing encoding and decoding efficiency, reducing Chinese sequence length by about 54.11%, effectively increasing the model's maximum length in the Chinese domain.
They performed incremental pretraining using extensive bilingual data and conducted multi-round instruction fine-tuning, releasing both Base and Chat model weights for 7B and 13B versions.
The Firefly team collected and shared training datasets, including a 22GB Chinese-English pretraining corpus and multi-turn instruction data.
The project made available the full suite of codes for incremental pretraining, instruction fine-tuning, and evaluation, supporting some mainstream open-source models like LLaMA2 and others.
The Firefly model underwent both leaderboard and human evaluations, including creating a human evaluation set with 13 tasks to assess model performance.

Model and Data Overview

The project showcases both 7B and 13B models as Base and Chat versions. The Base model arises from LLaMA2 after Chinese vocabulary augmentation and serves as the foundation for further multi-turn dialogue fine-tuning reflected in Chat models. The team's exploration also included fine-tuning various base models to observe the impact on instruction fine-tuning outcomes.

Key Models:

Firefly-LLaMA2-7B/13B-Base
Firefly-LLaMA2-7B/13B-Chat
Firefly-Baichuan2-13B

Training datasets include a diverse range of open-source corpora and customized datasets, such as ancient Chinese poetic and prose literature.

Model Evaluation

The evaluation process involved objective assessments on platforms like CMMLU and Open LLM Leaderboard, focusing on both language abilities. Additionally, a custom human evaluation dataset was assembled to provide a comprehensive performance perspective, given that standard SR tests often inadequately capture model nuances.

Open LLM Leaderboard Performance

Firefly-LLaMA2-13B-Chat achieved proximity to established models like LLaMA2-13B-Chat, showcasing its retention of superior English abilities, whereas the 7B variant demonstrated remarkable standings against competitive benchmarks.

CMMLU Ranking

The leading Firefly-Baichuan2-13B model secured a top-eight position in the OpenCompass CMMLU leaderboard, illustrating significant improvements over various other models. A detailed analysis indicates the efficiency of training setup and data strategies in achieving results comparable to more resource-intensive models.

Human Evaluation

A synthetic evaluation suite comprised of tasks across brainstorming, classification, harmful content check, math, translation, and more, provided comparative insights revealing a balanced performance across diverse linguistic challenges.

Conclusion

Firefly-LLaMA2-Chinese stands as a robust, accessible model team that managed to accomplish high performance with a fraction of typical resource deployment in AI model development through efficient incremental pretraining and comprehensive data utilization. The project's fully open-source approach also empowers broader community engagement and potential contributions, paving the way for future advancements in AI linguistic models.

The team’s work continues as they prepare to release a detailed technical report and encourage community interaction for prospective improvements and implementations.