Introduction to the Uniem Project
The Uniem project aims to create the best general-purpose text embedding model for the Chinese language. This ambitious initiative involves the training, fine-tuning, and evaluation of models, with both the models and datasets being made available to the open-source community on HuggingFace.
π Key Updates
-
Release 0.3.0 (July 11, 2023): The
FineTuner
now supports not only M3E but also models likesentence_transformers
andtext2vec
. It introduces methods such as SGPT for training GPT series models and Prefix Tuning. Note that FineTuner's API has undergone slight changes and is not backward compatible with version 0.2.0. -
Release 0.2.1 (June 17, 2023): Implemented
FineTuner
for native model fine-tuning, allowing seamless adaptation with just a few lines of code. -
Official Release of MTEB-zh (June 17, 2023): Supports automated evaluation of six major categories of embedding models across four task categories and utilizes nine datasets.
-
Launch of M3E Models (June 8, 2023): These models outperform the
openai text-embedding-ada-002
in Chinese text classification and retrieval tasks. Refer to the M3E models README for more details.
π§ Using M3E Models
The M3E series is fully compatible with sentence-transformers. Models like M3E can be seamlessly integrated into projects that support sentence-transformers, such as chroma, guidance, and semantic-kernel.
To install the necessary tools, use the following command:
pip install sentence-transformers
To use the M3E model:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("moka-ai/m3e-base")
embeddings = model.encode(['Hello World!', 'δ½ ε₯½,δΈη!'])
π¨ Fine-Tuning Models
Uniem provides an easy-to-use interface for fine-tuning models. A simple example is provided below:
from datasets import load_dataset
from uniem.finetuner import FineTuner
dataset = load_dataset('shibing624/nli_zh', 'STS-B')
finetuner = FineTuner.from_pretrained('moka-ai/m3e-small', dataset=dataset)
finetuner.run(epochs=3)
For detailed information on fine-tuning, refer to the Uniem Fine-tuning Tutorial or open it directly in Colab.
To run this locally, prepare your environment with:
conda create -n uniem python=3.10
pip install uniem
π― MTEB-zh Evaluation
The lack of a unified evaluation standard for Chinese embedding models led to the development of MTEB-zh, inspired by MTEB. It evaluates six model types across various datasets. For more details, refer to the MTEB-zh documentation.
Text Classification
This category evaluates models using six publicly available datasets on HuggingFace which include news, e-commerce reviews, stock comments, long texts, etc. Accuracy is reported using MTEB methods.
Retrieval and Ranking
Using the T2Ranking dataset, models are evaluated on metrics such as map@1, map@10, mrr@1, mrr@10, ndcg@1, and ndcg@10.
π€ Contributing
Contributions are welcome! If you wish to add datasets or models to MTEB-zh, feel free to open an issue or a PR. The community looks forward to your valuable input!
π License
Uniem is licensed under the Apache-2.0 License. More details can be found in the LICENSE file.
π· Citation
To cite this model, use the following format:
@software {Moka Massive Mixed Embedding,
author = {Wang Yuxin, Sun Qingxuan, He sicheng},
title = {M3E: Moka Massive Mixed Embedding Model},
year = {2023}
}
Uniem is a cutting-edge project that significantly contributes to the advancement of text embeddings in the Chinese language, blending innovation with accessibility for researchers and developers alike.