instructor-embedding - Improve Text Embedding Models for Versatile Applications with Instruction-Finetuning Techniques

Instructor-Embedding Project Introduction

Overview

The instructor-embedding project presents a highly adaptable text embedding model named Instructor👨‍🏫. This innovative model is specially designed to handle a diverse range of tasks such as classification, retrieval, clustering, and text evaluation across various domains like science and finance. What sets Instructor apart is its ability to generate task-specific text embeddings simply by receiving a task instruction, without any need for additional fine-tuning. This model achieves state-of-the-art performance on 70 different embedding tasks.

Key Improvements

To address some outdated elements of the original repository, the forked version of Instructor model made the following enhancements:

Compatibility with sentence-transformers library versions above 2.2.2.
Enhanced model downloading from Huggingface using the new "snapshot download" API.
The ability to specify a custom download location via the "cache_dir" parameter.

Installation and Setup

Getting started with Instructor is straightforward:

Set up an environment using Conda:

conda env create -n instructor python=3.7
git clone https://github.com/HKUNLP/instructor-embedding
pip install -r requirements.txt

Install the InstructorEmbedding package from PyPI:
```
pip install InstructorEmbedding
```
Activate your environment:
```
conda activate instructor
```

Using the Model

After setting up, you can immediately start using a pretrained model:

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')

Simply provide a sentence and a custom task-specific instruction to generate embeddings:

text_instruction_pairs = [
    {"instruction": "Represent the Science title:", "text": "3D ActionSLAM: wearable person tracking in multi-floor environments"}
]
customized_embeddings = model.encode(text_instruction_pairs)

The model output will be a list of numpy arrays containing the embeddings. This simple process highlights the model's adaptability to various tasks by leveraging task-specific instructions.

Use Cases

Instructor model can be utilized in numerous scenarios:

Custom Text Embeddings: Generate unique embeddings tailored for specific tasks by writing instructions in a standard template.
Text Similarity: Calculate similarity scores between custom-generated embeddings of different texts.
Information Retrieval: Retrieve relevant documents by matching query embeddings with document embeddings.
Clustering: Group texts into clusters based on the similarity of their custom embeddings.

Training

To train the Instructor model with new data, the project provides a dataset called Multitask Embeddings Data with Instructions (MEDI). It includes over 330 datasets from various sources mentioned in the project documentation, formatted for compatibility with the model. Training involves using this data with specified arguments to fine-tune the model for additional tasks.

Evaluation

Instructor is tested extensively across:

MTEB: A benchmark suite for holistic embedding model evaluation.
Billboard: Comparison of cosine similarity-based text generation evaluations with human judgments.
Prompt Retrieval: Retrieves similar examples using embeddings, enhancing prompt-based learning.

Through these evaluations, Instructor demonstrates superior performance across various text embedding tasks, proving its versatility and effectiveness in real-world applications.

Conclusion

The Instructor-embedding project stands out as a considerable advancement in the field of natural language processing. Its flexibility for task-specific applications, without requiring additional training, makes it an exceptionally powerful tool for anyone working with text embeddings across diverse challenges and domains.