uniem - Innovative Universal Text Embedding for Enhanced Chinese NLP

Introduction to the Uniem Project

The Uniem project aims to create the best general-purpose text embedding model for the Chinese language. This ambitious initiative involves the training, fine-tuning, and evaluation of models, with both the models and datasets being made available to the open-source community on HuggingFace.

🌟 Key Updates

Release 0.3.0 (July 11, 2023): The FineTuner now supports not only M3E but also models like sentence_transformers and text2vec. It introduces methods such as SGPT for training GPT series models and Prefix Tuning. Note that FineTuner's API has undergone slight changes and is not backward compatible with version 0.2.0.
Release 0.2.1 (June 17, 2023): Implemented FineTuner for native model fine-tuning, allowing seamless adaptation with just a few lines of code.
Official Release of MTEB-zh (June 17, 2023): Supports automated evaluation of six major categories of embedding models across four task categories and utilizes nine datasets.
Launch of M3E Models (June 8, 2023): These models outperform the openai text-embedding-ada-002 in Chinese text classification and retrieval tasks. Refer to the M3E models README for more details.

🔧 Using M3E Models

The M3E series is fully compatible with sentence-transformers. Models like M3E can be seamlessly integrated into projects that support sentence-transformers, such as chroma, guidance, and semantic-kernel.

To install the necessary tools, use the following command:

pip install sentence-transformers

To use the M3E model:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("moka-ai/m3e-base")
embeddings = model.encode(['Hello World!', '你好,世界!'])

🎨 Fine-Tuning Models

Uniem provides an easy-to-use interface for fine-tuning models. A simple example is provided below:

from datasets import load_dataset
from uniem.finetuner import FineTuner

dataset = load_dataset('shibing624/nli_zh', 'STS-B')
finetuner = FineTuner.from_pretrained('moka-ai/m3e-small', dataset=dataset)
finetuner.run(epochs=3)

For detailed information on fine-tuning, refer to the Uniem Fine-tuning Tutorial or open it directly in Colab.

To run this locally, prepare your environment with:

conda create -n uniem python=3.10
pip install uniem

💯 MTEB-zh Evaluation

The lack of a unified evaluation standard for Chinese embedding models led to the development of MTEB-zh, inspired by MTEB. It evaluates six model types across various datasets. For more details, refer to the MTEB-zh documentation.

Text Classification

This category evaluates models using six publicly available datasets on HuggingFace which include news, e-commerce reviews, stock comments, long texts, etc. Accuracy is reported using MTEB methods.

Retrieval and Ranking

Using the T2Ranking dataset, models are evaluated on metrics such as map@1, map@10, mrr@1, mrr@10, ndcg@1, and ndcg@10.

🤝 Contributing

Contributions are welcome! If you wish to add datasets or models to MTEB-zh, feel free to open an issue or a PR. The community looks forward to your valuable input!

📜 License

Uniem is licensed under the Apache-2.0 License. More details can be found in the LICENSE file.

🏷 Citation

To cite this model, use the following format:

@software {Moka Massive Mixed Embedding,
author = {Wang Yuxin, Sun Qingxuan, He sicheng},
title = {M3E: Moka Massive Mixed Embedding Model},
year = {2023}
}

Uniem is a cutting-edge project that significantly contributes to the advancement of text embeddings in the Chinese language, blending innovation with accessibility for researchers and developers alike.