Linly - Advancing Chinese Language Models: Incremental Pretraining with LLaMA and Falcon

Introduction to the Linly Project

Overview

The Linly project provides the community with advanced Chinese language conversation models including Linly-ChatFlow, Chinese-LLaMA (1-2), Chinese-Falcon, and their training datasets. These models are designed to enhance Chinese language processing capabilities by leveraging both LLaMA and Falcon as foundational structures. They also extend their linguistic prowess to Chinese through incremental pre-training using bilingual corpora.

Key Models and Features

Linly-70B Model

In collaboration with APUS, the Linly-70B model showcases impressive performance across several benchmarks including ARC, HellaSwag, MMLU, and C-Eval, among others. This model exemplifies the capabilities of the project's models in handling complex language tasks.

Chinese LLaMA and Chinese-Falcon Models

These models serve as the backbone by utilizing LLaMA and Falcon as foundational layers. They undergo incremental pre-training with bilingual data to amplify their Chinese linguistic abilities. Linly-ChatFlow, a conversation model, is achieved through large-scale instruction-following training using aggregated multi-language instruction data.

Linly-OpenLLaMA Models

Linly-OpenLLaMA models, available in sizes 3B, 7B, and 13B, have been trained from scratch using a tokenizer optimized for the Chinese language. Pre-trained with 1TB of bilingual corpus, these models are shared under the Apache 2.0 license, supporting commercial use.

Project Goals

Full-tuning Models: The project supports full-parameter training of models like Chinese-LLaMA and Chinese-Falcon, with versions available through TencentPretrain and HuggingFace.
Model Reproducibility: Offers transparency with openly reproduced model details and full codes for data preparation, model training, and evaluation.
Deployment and Optimization: Implements diverse quantization strategies suitable for both CUDA and edge device deployment for inference.

Model Licensing and Availability

Linly's Chinese-Falcon and OpenLLaMA models are available under the Apache License 2.0, enabling commercial application. Conversely, Chinese-LLaMA models are accessible for research purposes under the GNU General Public License v3.0.

Model Applications and Examples

Linly's models excel in various applications, including information extraction, code generation, and question answering. A demo of Linly-ChatFlow is available online, allowing users to experience the system's capabilities interactively.

Getting Started

Beginners can access detailed guides for model downloading and deployment. The project provides resources for deploying the models locally or utilizing online platforms like HuggingFace, along with comprehensive training details for understanding model functionality and improvement.

Future Developments

The Linly project remains dynamic, with ongoing model iterations and updates. The community can expect continued enhancements to model performance and capability expansions, maintaining a cutting-edge stance in Chinese linguistic AI development.