Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings
Overview
Sentence Transformers is a versatile framework designed to compute dense vector representations for sentences, paragraphs, and images using advanced transformer networks like BERT, RoBERTa, and XLM-RoBERTa. These models achieve remarkable performance in various tasks by enabling text to be embedded in a vector space, ensuring that similar texts are closer to each other and can be efficiently matched using techniques like cosine similarity.
The framework provides a broad array of pre-trained models, fine-tuned for more than 100 languages, catering to diverse use-cases. Additionally, Sentence Transformers facilitates the fine-tuning of custom embedding models, allowing users to tailor solutions for specific tasks.
Installation
To get started with Sentence Transformers, the recommended configuration includes Python 3.8+, PyTorch 1.11.0+, and transformers v4.34.0+. There are several ways to install it:
-
Using pip:
pip install -U sentence-transformers
-
Using conda:
conda install -c conda-forge sentence-transformers
-
From source: You can clone the latest version from the repository and install it directly:
pip install -e .
If GPU support is desired, ensure PyTorch is installed with the appropriate CUDA version.
Getting Started
To quickly begin using the pre-trained models, follow these simple steps:
-
Import the SentenceTransformer class and load a pre-trained model:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2")
-
Provide sentences to the model for embedding:
sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium.", ] embeddings = model.encode(sentences) print(embeddings.shape)
-
Use the embeddings for similarity computations:
similarities = model.similarity(embeddings, embeddings) print(similarities)
Pre-Trained Models
The framework offers a significant number of pre-trained models for a wide range of languages. Users can load these models by specifying the model name tailored to general or specific use cases.
Training Custom Models
Sentence Transformers allows for fine-tuning sentence embedding methods, leading to task-specific embeddings. Users can employ different techniques to achieve optimal results tailored to specific tasks. During training, the models support various transformer networks and offer features like multi-lingual learning, multiple loss functions, and evaluation during the training process to derive the best model.
Application Examples
The framework is highly versatile and can be applied in numerous scenarios, including:
- Computing Sentence Embeddings
- Semantic Textual Similarity
- Semantic Search
- Retrieve & Re-Rank
- Clustering
- Paraphrase Mining
- Translated Sentence Mining
- Multilingual Image Search, Clustering & Duplicate Detection
For more application examples and detailed guides, visit the examples section.
Development Setup
For those interested in contributing, after cloning the repository, use the following commands in a virtual environment:
python -m pip install -e ".[dev]"
pre-commit install
Test changes with:
pytest
Citing the Project
If the framework is helpful in research or development projects, users are encouraged to cite the related publications.
Maintainer: Tom Aarsen and the Hugging Face team.
For detailed information, visit the official documentation.
If any issues arise or further questions exist, users are welcome to open an issue in the repository's issue tracker.