CDial-GPT - LCCC and CDial-GPT Comprehensive Chinese Dialogue and Pretraining Models

CDial-GPT Project Introduction

CDial-GPT is an innovative project that offers a comprehensive solution for Chinese dialogue models. It provides a large-scale Chinese conversation dataset alongside pre-trained dialogue models specifically tailored for Chinese conversations. This project is based on the enhancements from various existing frameworks and libraries.

Background

The project builds upon the TransferTransfo code base, which is synonymous with the work of HuggingFace in the realm of natural language processing. It utilizes the Pytorch version of the HuggingFace Transformers library, which facilitates both pre-training and fine-tuning of models.

Dataset Overview

CDial-GPT provides the LCCC (Large-scale Cleaned Chinese Conversation) dataset. This dataset is divided into two main categories:

LCCC-base: This part of the dataset is rigorously filtered to ensure high dialogue quality. It includes a dataset obtained from conversations on platforms like Weibo, encompassing both single-round and multi-round dialogues.
LCCC-large: This part expands on the LCCC-base data, incorporating additional open-source dialogue datasets to enrich the variety of conversations.

For both datasets, detailed statistics on dialogues, total characters, vocabulary size, and more are provided. Notably, the dialogue cleaning process in LCCC-base is more stringent than in LCCC-large, resulting in a smaller but higher quality subset.

Pre-trained Models

The CDial-GPT project also features several pre-trained Chinese dialogue models:

GPT_Novel: Pre-trained using a Chinese novel dataset, this model forms the groundwork for subsequent models.
CDial-GPT_LCCC-base and CDial-GPT2_LCCC-base: These are improved versions of the GPT model, further trained on the LCCC-base dataset.
CDial-GPT_LCCC-large: Trained on the expansive LCCC-large dataset, pushing the boundaries of dialogue modeling in Chinese language.

The models are crafted by first pre-training on a substantial corpus of Chinese novels followed by additional training on the LCCC dataset. This approach ensures the models adeptly handle the intrinsic nuances of the Chinese dialogue.

Getting Started

The project provides clear instructions for direct installation from the source. It includes steps for preparing datasets and models for fine-tuning. Users can download the required datasets and begin the training process with the provided scripts, leveraging either singular or distributed GPU setups.

The system facilitates interaction through both command-line interfaces and pre-scripted tests, allowing users to generate responses on specified datasets.

Evaluation

Evaluation is a key facet of the CDial-GPT project, leveraging automatic evaluation metrics to gauge model performance. Metrics such as Perplexity (PPL), BLEU scores, and distinctiveness measures are used alongside others like Greedy Matching and Embedding Average to thoroughly assess model outputs.

In summary, CDial-GPT is a sophisticated and well-rounded platform for those interested in working with Chinese dialogue systems. It combines large, meticulously cleaned datasets with robust, pre-trained models, offering an all-in-one framework for Chinese language dialogue model development and evaluation.