YAYI - Enhance Multidomain Capabilities with Advanced Chinese AI Model

Introduction to the YAYI Large Model

YAYI, by Wenge Research, is an advanced large-scale model fine-tuned on millions of high-quality structured datasets across diverse fields such as media outreach, public opinion analysis, public safety, financial risk control, and urban governance. Through comprehensive pre-training and subsequent iteration, YAYI's foundational and analytical capabilities in the Chinese language have been progressively enhanced. Moreover, the model incorporates multi-turn dialogue and plugin capabilities. Continuous feedback from hundreds of users during internal tests has further optimized its performance and safety.

The open-source release of the YAYI model seeks to bolster the open-source community within the Chinese pre-training model landscape, encouraging collaboration and ecosystem growth among its partners.

Model Release

Available Models

YAYI-7B:
- Model Identifier: wenge-research/yayi-7b
- Download
YAYI-7B-Llama2:
- Model Identifier: wenge-research/yayi-7b-llama2
- Download
YAYI-13B-Llama2:
- Model Identifier: wenge-research/yayi-13b-llama2
- Download

Model Deployment

Environment Setup

Clone the repository to your server:

git clone https://github.com/wenge-research/YAYI.git
cd YAYI

Create a conda environment:

conda create --name yayi python=3.8
conda activate yayi

Install dependencies:
```
pip install -r requirements.txt
```

Inference

The model weights for the yayi-7b version are available in the Huggingface model repository. Here’s a code snippet to initiate inference, able to run on a single GPU such as A100/A800/3090, with a memory usage of approximately 20GB when using FP16 precision:

from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import torch

yayi_7b_path = "wenge-research/yayi-7b"
tokenizer = AutoTokenizer.from_pretrained(yayi_7b_path)
model = AutoModelForCausalLM.from_pretrained(yayi_7b_path, device_map="auto", torch_dtype=torch.bfloat16)

prompt = "Hello"
formatted_prompt = f"<|System|>:\nA chat between a human and an AI assistant named YaYi.\nYaYi is a helpful and harmless language model developed by Beijing Wenge Technology Co.,Ltd.\n\n<|Human|>:\n{prompt}\n\n<|YaYi|>:"
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)

eos_token_id = tokenizer("<|End|>").input_ids[0]
generation_config = GenerationConfig(
    eos_token_id=eos_token_id,
    pad_token_id=eos_token_id,
    do_sample=True,
    max_new_tokens=100,
    temperature=0.3,
    repetition_penalty=1.1,
    no_repeat_ngram_size=0
)
response = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(response[0]))

Fine-Tuning

The project leverages the deepspeed framework for model training. Once the environment is configured, the relevant script can be executed to start training. The system supports instruction data full-parameter fine-tuning, LoRA fine-tuning, and multi-round dialogue data fine-tuning.

Instruction Data Full-Parameter Fine-Tuning:

Utilizes JSON formatted data with fields for "instruction", "input", and "output". Executing the specified command initiates the fine-tuning process with support for multi-GPU configurations.
LoRA Fine-Tuning:

This method provides a resource-efficient tuning approach using single GPU setups and facilitates high parameter models training by optimizing the lora-dim and lora-module-name settings.
Multi-Round Dialogue Data Fine-Tuning:

Similarly organized in JSON format, this method focuses on multi-turn conversation data and supports efficient training using multiple GPUs.

Training Dataset

YAYI is trained on a dataset of hundreds of thousands of field-specific instructions, primarily encompassing areas like finance, security, public opinion, and media, including safety-enhanced and plugin capability data.

Licensing & Limitations

Limitations

Current limitations of YAYI include possible inaccuracies in fact-based instructions, inadequately identification of harmful instructions, and underperformance in logic reasoning, code generation, and scientific calculations.

Disclaimer

The YAYI model is open-sourced for research purposes only and must not be used for commercial purposes or any activities that might harm society. Free access to the code, data, and models associated with the YAYI project is available under Apache-2.0, CC BY-NC 4.0, and a specific Model License.

Acknowledgments

The YAYI initiative utilizes components from BigScience bloomz-7b1-mt and Meta Llama 2, as well as training codebases such as Databricks’ dolly and Huggingface transformers, alongside distributed training tools like Microsoft’s DeepSpeed.