sentence-transformers - Enhance Natural Language Processing with Advanced Sentence and Image Embeddings

Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings

Overview

Sentence Transformers is a versatile framework designed to compute dense vector representations for sentences, paragraphs, and images using advanced transformer networks like BERT, RoBERTa, and XLM-RoBERTa. These models achieve remarkable performance in various tasks by enabling text to be embedded in a vector space, ensuring that similar texts are closer to each other and can be efficiently matched using techniques like cosine similarity.

The framework provides a broad array of pre-trained models, fine-tuned for more than 100 languages, catering to diverse use-cases. Additionally, Sentence Transformers facilitates the fine-tuning of custom embedding models, allowing users to tailor solutions for specific tasks.

Installation

To get started with Sentence Transformers, the recommended configuration includes Python 3.8+, PyTorch 1.11.0+, and transformers v4.34.0+. There are several ways to install it:

Using pip:
```
pip install -U sentence-transformers
```

Using conda:

conda install -c conda-forge sentence-transformers

From source: You can clone the latest version from the repository and install it directly:
```
pip install -e .
```

If GPU support is desired, ensure PyTorch is installed with the appropriate CUDA version.

Getting Started

To quickly begin using the pre-trained models, follow these simple steps:

Import the SentenceTransformer class and load a pre-trained model:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")

Provide sentences to the model for embedding:

sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)

Use the embeddings for similarity computations:

similarities = model.similarity(embeddings, embeddings)
print(similarities)

Pre-Trained Models

The framework offers a significant number of pre-trained models for a wide range of languages. Users can load these models by specifying the model name tailored to general or specific use cases.

Training Custom Models

Sentence Transformers allows for fine-tuning sentence embedding methods, leading to task-specific embeddings. Users can employ different techniques to achieve optimal results tailored to specific tasks. During training, the models support various transformer networks and offer features like multi-lingual learning, multiple loss functions, and evaluation during the training process to derive the best model.

Application Examples

The framework is highly versatile and can be applied in numerous scenarios, including:

Computing Sentence Embeddings
Semantic Textual Similarity
Semantic Search
Retrieve & Re-Rank
Clustering
Paraphrase Mining
Translated Sentence Mining
Multilingual Image Search, Clustering & Duplicate Detection

For more application examples and detailed guides, visit the examples section.

Development Setup

For those interested in contributing, after cloning the repository, use the following commands in a virtual environment:

python -m pip install -e ".[dev]"
pre-commit install

Test changes with:

pytest

Citing the Project

If the framework is helpful in research or development projects, users are encouraged to cite the related publications.

Maintainer: Tom Aarsen and the Hugging Face team.

For detailed information, visit the official documentation.

If any issues arise or further questions exist, users are welcome to open an issue in the repository's issue tracker.