Pretrained-Language-Model - Comprehensive Exploration of Huawei's Chinese Language Models and Optimization Methods

Pretrained Language Model

The Pretrained Language Model repository, developed by Huawei Noah’s Ark Lab, provides cutting-edge pretrained language models along with their respective optimization techniques. This collection particularly emphasizes models suitable for Chinese language processing tasks.

Overview of the Directory

PanGu-α

PanGu-α is a monumental autoregressive Chinese language model featuring up to 200 billion parameters. This model is developed using MindSpore and optimized on Ascend 910 AI processors, highlighting its capability to handle vast amounts of data efficiently.

NEZHA

The NEZHA-TensorFlow variant of the NEZHA model achieves state-of-the-art results in various Chinese natural language processing tasks using TensorFlow. An alternative version, NEZHA-PyTorch, caters to PyTorch users, offering the same performance excellence.

NEZHA-Gen

This project includes NEZHA-Gen-TensorFlow, featuring two specific GPT models. The first, Yuefu (乐府), is designed for generating Chinese classical poetry, while the second is a general-purpose Chinese GPT model.

TinyBERT and its Variants

TinyBERT represents a compressed BERT model that is 7.5 times smaller in size and 9.4 times faster during inference, making it efficient for deployment. Its counterpart, TinyBERT-MindSpore, is tailored for MindSpore.

DynaBERT

DynaBERT is a flexible BERT model with adjustable width and depth, catering to diverse computational needs.

BBPE

A specialized tool exists for constructing byte-level vocabulary known as BBPE, which also includes an associated tokenizer.

PMLM

PMLM offers a probabilistically masked language model, acting as a simpler approximation to XLNet without the complex two-stream self-attention mechanism.

TernaryBERT

Developed under PyTorch, TernaryBERT uses a weights ternarization method, which also has a version for MindSpore: TernaryBERT-MindSpore.

HyperText

HyperText applies hyperbolic geometry principles to create a model efficient at text classification tasks.

BinaryBERT

BinaryBERT utilizes a weights binarization strategy through ternary weight splitting within the BERT model framework, also developed under PyTorch.

AutoTinyBERT

AutoTinyBERT is a versatile model zoo designed to meet various latency demands.

PanGu-Bot

PanGu-Bot leverages the GPU implementation of PanGu-α to develop an open-domain Chinese dialogue model.

CeMAT

CeMAT serves as a comprehensive sequence-to-sequence multilingual model for both autoregressive and non-autoregressive machine translation tasks.

Noah_WuKong

Noah_WuKong combines a large-scale dataset with benchmark models for Chinese vision-language tasks, available both in a standard version and a MindSpore version.

CAME

The Confidence-guided Adaptive Memory Efficient Optimizer, CAME, enhances model training efficiency under various conditions.

This repository is a comprehensive toolkit for researchers and developers focusing on Chinese natural language processing, offering diverse models and tools to suit different needs and platforms.