YAYI2 - Refined Multilingual Language Models Utilizing Transformer Architecture

Introduction to YAYI 2

YAYI 2 is the latest open-source language model developed by the Chinese Academy of Science and Technology's Wenge Research. This project includes both Base and Chat versions, boasting a parameter scale of 30 billion. Built on the Transformer architecture, YAYI2-30B underwent pre-training using a massive dataset of over 2 trillion high-quality, multilingual tokens. The model excels in general and specialized application scenarios, having been fine-tuned with millions of instructions and aligned with human values through reinforcement learning from human feedback.

The YAYI2-30B Base model has been released for open-source use, aimed at advancing the Chinese pre-training model community. This effort invites collaboration and contribution from partners to enrich the YAYI language model ecosystem. For in-depth technical insights, the project team has published a comprehensive report available on Arxiv.

Dataset Access

YAYI 2 uses an extensive pre-training dataset. The details are as follows:

Dataset Name: YAYI2 Pretrain Data
Size: 500GB
Hugging Face Model Identifier: wenge-research/yayi2_pretrain_data
Download Links: Available on Hugging Face and ModelScope

Model Access

YAYI 2 offers two models:

YAYI2-30B: The Base model with a context length of 4096. Available on Hugging Face and ModelScope.
YAYI2-30B-Chat: A chat-oriented model with the same context length, planned for release soon.

Evaluation Results

YAYI 2 has been rigorously tested on various benchmark datasets, such as C-Eval, MMLU, CMMLU, and others. The evaluations cover areas like language comprehension, subject knowledge, math reasoning, logic reasoning, and code generation. The YAYI 2 model outperforms many similarly sized open-source models.

The model's excellent performance, evaluated through the OpenCompass repository, attests to its capability. Results from other models like MPT, Falcon, and LLaMA 2 serve as a comparative benchmark.

Inference

A simple example is provided to demonstrate how to perform inferences using YAYI2-30B. This setup can be executed on a single A100/A800 GPU.

Environment Setup

Clone the repository:

git clone https://github.com/wenge-research/YAYI2.git
cd YAYI2

Create a conda environment:

conda create --name yayi_inference_env python=3.8
conda activate yayi_inference_env

Note: Python 3.8 or higher is required.

Install dependencies:

pip install transformers==4.33.1
pip install torch==2.0.1
pip install sentencepiece==0.1.99
pip install accelerate==0.25.0

Base Model Inference Code

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("wenge-research/yayi2-30b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("wenge-research/yayi2-30b", device_map="auto", trust_remote_code=True)

inputs = tokenizer('The winter in Beijing is', return_tensors='pt')
inputs = inputs.to('cuda')
pred = model.generate(
    **inputs, 
    max_new_tokens=256, 
    eos_token_id=tokenizer.eos_token_id, 
    do_sample=True,
    repetition_penalty=1.2,
    temperature=0.4, 
    top_k=100, 
    top_p=0.8
)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

Note that the initial access and loading of the model could take some time.

Model Fine-Tuning

The project supports instruction fine-tuning using the deepspeed distributed training framework. Users can carry out full-parameter fine-tuning or LoRA fine-tuning.

Environment Setup

Create a conda environment:

conda create --name yayi_train_env python=3.10
conda activate yayi_train_env

Install dependencies:
```
pip install -r requirements.txt
```
Install accelerate:
```
pip install --upgrade accelerate
```

Install flashattention:

pip install flash-attn==2.0.3 --no-build-isolation
pip install triton==2.0.0.dev20221202  --no-deps

Full Parameter Training

Data Format: Refer to data/yayi_train_example.json. It is a standard JSON file, where each data entry consists of a "system" and "conversations".

Run Instructions: The full-parameter fine-tuning of the YAYI model can be initiated with the following command, suitable for multi-machine, multi-GPU training with at least 16*A100(80G).

deepspeed --hostfile config/hostfile \
    --module training.trainer_yayi2 \
    --report_to "tensorboard" \
    --data_path "./data/yayi_train_example.json" \
    --model_name_or_path "your_model_path" \
    --output_dir "./output" \
    --model_max_length 2048 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 10 \
    --learning_rate 5e-6 \
    --warmup_steps 2000 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --gradient_checkpointing True \
    --deepspeed "./config/deepspeed.json" \
    --bf16 True

Alternatively, the process can be started via command line:

bash scripts/start.sh

For those who choose to use the ChatML template for instruction tuning, replace --module training.trainer_yayi2 with --module training.trainer_chatml.

LoRA Fine-Tuning

Data Format: Same as above, refer to data/yayi_train_example_multi_rounds.json.
Run Instructions: Start LoRA fine-tuning with:
```
bash scripts/start_lora.sh
```

For more extensive details on pre-training data, refer to the project's documentation.