vocab-coverage - Analysis of Chinese Cognition in AI Language Models

A Comprehensive Overview of the Vocab-Coverage Project

Introduction

The vocab-coverage project aims to assess the Chinese language cognition abilities of various language models. The investigation covers several core areas: Chinese character recognition, input-side word vector distribution, and output-side word vector distribution. Given the complexity of large-scale language models, a multi-faceted approach is necessary to gain a thorough understanding. This analysis provides one perspective into the inner workings of language models and hopes to serve as a reference for future model evaluation.

Chinese Character Recognition Analysis

To evaluate the models' Chinese character recognition, three commonly used character sets are employed: the Table of General Standard Chinese Characters, the Chart of Standard Forms of Common National Characters, and Unicode's CJK Unified Ideographs, totaling 21,267 characters.

Character Sets

The Table of General Standard Chinese Characters includes 8,105 characters categorized into three levels of frequency, derived from extensive consultation and finalized in 2013 by China's Ministry of Education.
The Chart of Standard Forms of Common National Characters from Taiwan contains 4,808 frequently used characters and an additional selection of less common characters.
The Unicode CJK Unified Ideographs initially included 20,902 characters as per the 1993 ISO/Unicode standards, with modern updates increasing this count significantly.

Literacy Judgment

The literacy ability of a model is judged based on how it tokenizes Chinese characters:

Tokenizer Encoding: Models like WordPiece and BBPE (Byte-level BPE) handle encoding differently. A model recognizes a character if its tokenizer can encode it directly as a single token. In cases where a character is split into multiple tokens, the recognition is considered imperfect.
Color-Coding for Recognition: In visualization, different colors are used to represent characters from various subsets, indicating the recognition and tokenization levels achieved.

Word Vector Distribution Analysis

Examining the word vectors' spatial distribution provides insight into a model's semantic understanding. Mere inclusion of characters in a vocab list does not equate to semantic comprehension—training to align vectors semantically is also necessary.

Classification of Characters

Tokens are categorized by language: Chinese, English, Japanese, Korean, numbers, or other symbols, with further division in Chinese tokens into common and rare characters for more granular analysis.

Language Categorization

Common Chinese characters fall into well-used categories in major dictionaries and character sets.
Rare characters are less frequent and often omitted from regular educational resources.

Analysis and Observations

The project provides a comprehensive comparison across different models, such as BERT, ERNIE, various multilingual models, and ones developed by OpenAI. It offers insights into each model's strengths and limitations concerning Chinese text.

Usage of the Vocab-Coverage Command-Line Tool

The project includes a command-line tool, vocab-coverage, facilitating character and vocabulary analysis:

Installation

Instructions are provided for seamless tool setup.

Usage

The tool offers several subcommands, aiding in charset exploration, coverage calculation, and word vector embeddings, which help in evaluating a model’s comprehension capabilities.

By utilizing color codes and spatial analysis, this project visually represents and analyses the nuanced understanding that language models possess regarding the different sets of characters. This detailed evaluative approach serves as a crucial step towards more advanced NLP model development and evaluation.