CLUE - Improve Chinese Language Processing Capabilities with Extensive Datasets and Benchmarks

Introduction to the CLUE Project

The CLUE (Chinese Language Understanding Evaluation) benchmark represents a comprehensive toolkit designed to evaluate the ability of models to understand Chinese language tasks. It includes a collection of datasets, baseline (pre-trained) models, corpora, and a leaderboard, all tailored to gauge a model's performance in processing and understanding the Chinese language.

What is CLUE?

At its core, the CLUE benchmark incorporates a series of representative tasks associated with specific datasets. These datasets are selected to form the testing benchmark, covering various tasks, data volumes, and degrees of difficulty. By providing these benchmarks, CLUE aims to serve as a standard for evaluating Chinese language processing capabilities, especially in the realm of natural language processing (NLP).

Updates and Resources

For those interested in staying up-to-date with the latest in Chinese large models, "The Langya List" is a dedicated competition space showcasing leading models. Additionally, the "SuperCLUEAI" provides a ranking of the most current Chinese large models.

The SuperCLUE is a comprehensive evaluation benchmark for general-purpose Chinese models. It extends support through PaddleNLP, a core NLP project within the prominent domestic deep learning framework, PaddlePaddle.

Notably, the CLUE project has achieved academic recognition, with its paper being accepted with high scores at the international Conference on Computational Linguistics (COLING 2020).

The Leaderboard and Benchmark Tasks

The CLUE leaderboard is updated regularly, sourced from CLUE's official website. It features two main categories: classification tasks and reading comprehension tasks.

Classification Tasks

These tasks, such as sentiment analysis and intent classification, include:

AFQMC: Ant Financial Question Matching Corpus, measuring semantic similarity.
TNEWS: Short text news classification.
IFLYTEK: Classification of long text into various application-related categories.
CMNLI: Chinese Natural Language Inference.
CLUEWSC2020: Winograd Schema Challenge in Chinese.
CSL: Chinese scientific literature classification.

Scores for these tasks are aggregated from the results of multiple models like BERT, ERNIE, and RoBERTa to provide an average score.

Reading Comprehension Tasks

These tasks evaluate a model's understanding of text, such as:

CMRC2018: Extractive reading comprehension.
CHID: Cloze-style idiom prediction.
C³: Chinese multiple-choice reading comprehension.

Again, a variety of pre-trained models are used to obtain a comprehensive evaluation of the task.

Using CLUE Benchmark

To utilize the CLUE benchmark, users can clone the project from GitHub and access scripts to run baseline models for classification or reading comprehension tasks. It supports both GPU and TPU, facilitating seamless model training and evaluation.

Additional Tools and Resources

The benchmark also offers a comprehensive toolkit, PyCLUE, supporting 10 tasks across 9 models. It's designed for easy implementation and custom task creation.

Corpus for Pre-training and Language Modeling

The CLUE corpus, known as CLUECorpus2020, is a rich textual resource for tasks like language modeling, pre-training, or generating tasks. It contains over 14GB of data, comprising nearly 50 billion characters distributed across thousands of well-defined text files. This corpus includes diverse sub-corpora such as news, community interactions, and more, all formatted for pre-training use.

Vision and Goals

The overarching vision for the CLUE benchmark is to serve Chinese language understanding and industrial needs by complementing general language model evaluations. Through its infrastructure, CLUE aims to stimulate further development and advancement in Chinese language models.

Datasets Overview

CLUE provides a range of datasets which are foundational to understanding the depth and variety of Chinese linguistic challenges. Some key datasets include:

AFQMC: Ant Financial Question Matching Corpus, focusing on semantic similarity in financial customer service queries.
TNEWS: A dataset for categorizing Chinese news articles based on short text classification.
IFLYTEK: Long text classifications extracted from application descriptions across various topics.
OCNLI: Original Chinese Natural Language Inference, providing non-translated Chinese data for logical relation inference.

Each dataset provides essential insights into specific tasks, supporting comprehensive model evaluation and development.

In summary, the CLUE benchmark offers a structured and versatile platform for the assessment and development of Chinese language models, fostering progress in the understanding and application of Chinese NLP tasks.