GPT2-Chinese Project Overview
Introduction
The GPT2-Chinese project is a Chinese version of the GPT-2 training code, utilizing either a BERT tokenizer or a BPE tokenizer. This initiative is inspired by the HuggingFace team's remarkable repository, Transformers. The project allows for creative outputs such as writing poems, news articles, and novels, or training general language models. It supports character-level, word-level, and BPE-level tokenization, making it versatile for handling large training corpuses in the Chinese language.
Project Updates
Update 04.11.2024
The project's creator notes that following the release of ChatGPT, there has been renewed interest in GPT2-Chinese. Initially a project for self-learning PyTorch, it was not intended for long-term maintenance. However, for enthusiasts of large language models (LLM), the creator is open for discussions via email or project issue threads.
Update 02.06.2021
A significant update added several pretrained models: a general Chinese GPT-2 model, a smaller version, a Chinese lyrics model, and a Classical Chinese model. These models, trained using the UER-py project, are available on the Huggingface Model Hub. Users are instructed to prepend input text with a start symbol for correct text generation.
Update 11.03.2020
Pretrained models for Chinese poetry and couplets were introduced. These models, also available on Huggingface, require specific input formatting for text generation. The poetry model outputs verses with a starting token, and the couplet model uses a fixed format for input.
Project Evolution and News
Back in November 2019, resources for GPT-2 in Chinese were scarce, but the situation has improved significantly. Although development on the project has ceased, the codebase remains stable. The code was initially written to practice using PyTorch and, despite potential rough edges, serves educational purposes.
Usage Instructions
To use GPT2-Chinese, users should create a data
folder in the root directory and place their training corpus named train.json
in it. This file should contain a JSON list, with each element being the text of an article for training. Running train.py
with the --raw
option preprocesses the data and initiates training automatically.
Text Generation
Users can generate text samples using a command such as:
python ./generate.py --length=50 --nsamples=4 --prefix=xxx --fast_pattern --save_samples --save_samples_path=/mnt/xx
Options allow for optimized generation speeds and saving generated samples to specified directories.
Structure and Components
Key scripts in the project include generate.py
for text generation and train.py
for the training process. Other scripts like eval.py
evaluate model performance, while train.json
serves as a training sample format example. The directory also contains tokenizers and auxiliary tools for vocabulary construction and corpus handling.
Considerations and Tips
The project adopts BERT's tokenizer for processing Chinese characters. Depending on the tokenizer version, preprocessing before tokenization might be necessary. Users with substantial memory resources or smaller corpuses can modify certain scripts to streamline preprocessing.
Additional Resources
Corpora for training can be sourced from various repositories, and the project supports FP16 and gradient accumulation with specific libraries. For enthusiasts, there are external projects and models that build upon GPT2-Chinese, providing platforms for generating dialogues or visualizing attention mechanisms.
Citing the Project
For academic references, the citation format provided should be used. This allows for proper acknowledgment of the project's contributors and development.
Model Sharing
An array of trained models is available, covering general Chinese language, ancient texts, poetry, couplets, and lyrics, each trained on specific datasets. These models are made accessible for further experimentation and development.
Demo and Examples
A dedicated demo page showcases examples of generated texts, illustrating the project's capabilities in poetry and literary text generation. Enthusiastic users have contributed to these demos, offering insights into the model's application in artistic and literary contexts.