Introduction to YAYI 2
YAYI 2 is the latest open-source language model developed by the Chinese Academy of Science and Technology's Wenge Research. This project includes both Base and Chat versions, boasting a parameter scale of 30 billion. Built on the Transformer architecture, YAYI2-30B underwent pre-training using a massive dataset of over 2 trillion high-quality, multilingual tokens. The model excels in general and specialized application scenarios, having been fine-tuned with millions of instructions and aligned with human values through reinforcement learning from human feedback.
The YAYI2-30B Base model has been released for open-source use, aimed at advancing the Chinese pre-training model community. This effort invites collaboration and contribution from partners to enrich the YAYI language model ecosystem. For in-depth technical insights, the project team has published a comprehensive report available on Arxiv.
Dataset Access
YAYI 2 uses an extensive pre-training dataset. The details are as follows:
- Dataset Name: YAYI2 Pretrain Data
- Size: 500GB
- Hugging Face Model Identifier:
wenge-research/yayi2_pretrain_data
- Download Links: Available on Hugging Face and ModelScope
Model Access
YAYI 2 offers two models:
- YAYI2-30B: The Base model with a context length of 4096. Available on Hugging Face and ModelScope.
- YAYI2-30B-Chat: A chat-oriented model with the same context length, planned for release soon.
Evaluation Results
YAYI 2 has been rigorously tested on various benchmark datasets, such as C-Eval, MMLU, CMMLU, and others. The evaluations cover areas like language comprehension, subject knowledge, math reasoning, logic reasoning, and code generation. The YAYI 2 model outperforms many similarly sized open-source models.
The model's excellent performance, evaluated through the OpenCompass repository, attests to its capability. Results from other models like MPT, Falcon, and LLaMA 2 serve as a comparative benchmark.
Inference
A simple example is provided to demonstrate how to perform inferences using YAYI2-30B
. This setup can be executed on a single A100/A800 GPU.
Environment Setup
-
Clone the repository:
git clone https://github.com/wenge-research/YAYI2.git cd YAYI2
-
Create a conda environment:
conda create --name yayi_inference_env python=3.8 conda activate yayi_inference_env
Note: Python 3.8 or higher is required.
-
Install dependencies:
pip install transformers==4.33.1 pip install torch==2.0.1 pip install sentencepiece==0.1.99 pip install accelerate==0.25.0
Base Model Inference Code
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("wenge-research/yayi2-30b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("wenge-research/yayi2-30b", device_map="auto", trust_remote_code=True)
inputs = tokenizer('The winter in Beijing is', return_tensors='pt')
inputs = inputs.to('cuda')
pred = model.generate(
**inputs,
max_new_tokens=256,
eos_token_id=tokenizer.eos_token_id,
do_sample=True,
repetition_penalty=1.2,
temperature=0.4,
top_k=100,
top_p=0.8
)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
Note that the initial access and loading of the model could take some time.
Model Fine-Tuning
The project supports instruction fine-tuning using the deepspeed distributed training framework. Users can carry out full-parameter fine-tuning or LoRA fine-tuning.
Environment Setup
-
Create a conda environment:
conda create --name yayi_train_env python=3.10 conda activate yayi_train_env
-
Install dependencies:
pip install -r requirements.txt
-
Install accelerate:
pip install --upgrade accelerate
-
Install flashattention:
pip install flash-attn==2.0.3 --no-build-isolation pip install triton==2.0.0.dev20221202 --no-deps
Full Parameter Training
-
Data Format: Refer to
data/yayi_train_example.json
. It is a standard JSON file, where each data entry consists of a "system" and "conversations". -
Run Instructions: The full-parameter fine-tuning of the YAYI model can be initiated with the following command, suitable for multi-machine, multi-GPU training with at least 16*A100(80G).
deepspeed --hostfile config/hostfile \ --module training.trainer_yayi2 \ --report_to "tensorboard" \ --data_path "./data/yayi_train_example.json" \ --model_name_or_path "your_model_path" \ --output_dir "./output" \ --model_max_length 2048 \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 500 \ --save_total_limit 10 \ --learning_rate 5e-6 \ --warmup_steps 2000 \ --lr_scheduler_type cosine \ --logging_steps 1 \ --gradient_checkpointing True \ --deepspeed "./config/deepspeed.json" \ --bf16 True
Alternatively, the process can be started via command line:
bash scripts/start.sh
For those who choose to use the ChatML template for instruction tuning, replace --module training.trainer_yayi2
with --module training.trainer_chatml
.
LoRA Fine-Tuning
-
Data Format: Same as above, refer to
data/yayi_train_example_multi_rounds.json
. -
Run Instructions: Start LoRA fine-tuning with:
bash scripts/start_lora.sh
For more extensive details on pre-training data, refer to the project's documentation.