SimCSE: Simple Contrastive Learning of Sentence Embeddings
SimCSE represents a notable advance in natural language processing (NLP) by offering a framework for creating sentence embeddings through contrastive learning. This project is hosted on a repository that includes code and pre-trained models described in the paper "SimCSE: Simple Contrastive Learning of Sentence Embeddings".
Overview
SimCSE introduces a straightforward approach to generating sentence embeddings using contrastive learning. It is versatile enough to handle both labeled and unlabeled data.
- Unsupervised SimCSE operates by inputting a sentence and predicting itself within a contrastive framework, utilizing only standard dropout as noise.
- Supervised SimCSE enhances this by integrating labeled pairs from Natural Language Inference (NLI) datasets, categorizing 'entailment' pairs as positive examples and 'contradiction' pairs as hard negatives.
The project's architecture efficiently represents sentences in a high-dimensional space, allowing computations like similarity calculations and sentence retrieval.
Getting Started
SimCSE offers an easy-to-use sentence embedding tool. Users can install it via PyPI:
pip install simcse
Or directly from the source:
python setup.py install
To use GPU acceleration, proper PyTorch and CUDA versions are needed. Once installed, using SimCSE is simple. Here's a quick example:
from simcse import SimCSE
model = SimCSE("princeton-nlp/sup-simcse-bert-base-uncased")
embeddings = model.encode("A woman is reading.")
Functionality
SimCSE provides functionality like encoding sentences into embeddings, computing cosine similarities, and sentence search:
- Encoding: Converts sentences into numerical vectors for comparison.
- Cosine Similarity: Measures how similar two sets of sentences are.
- Search: Efficiently indexes sentences to quickly find matches.
For larger datasets, SimCSE can be integrated with faiss, a similarity search library, though there are some compatibility considerations with newer Nvidia GPUs.
Models
SimCSE has released several pre-trained models available via the simcse
library or HuggingFace Transformers. The models are crafted for different use cases:
- Unsupervised Models: Trained on Wikipedia corpus.
- Supervised Models: Trained on NLI datasets.
These models offer varying levels of performance based on the context they're used in, making them versatile for different NLP tasks.
Using SimCSE with HuggingFace
In addition to the straightforward tool, SimCSE integrates with the HuggingFace library, allowing users to leverage its powerful pre-trained models in their applications:
import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
model = AutoModel.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
# Further code for usage ...
Training Your Own Models
SimCSE provides scripts and guidance for users who wish to train their own models, either for unsupervised or supervised contexts. Specific setups, like data selection and model parameters, are crucial for successful training.
- Unsupervised: Typically uses a sampled set of Wikipedia sentences.
- Supervised: Utilizes SNLI and MNLI datasets, leveraging entailment and contradiction dynamics.
Bugs or Questions?
For issues or inquiries, users can contact the main contributors or raise issues through the repository for quick support.
Citation
If SimCSE is used in academic work, reference the work as:
@inproceedings{gao2021simcse,
title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
year={2021}
}
SimCSE in the Community
The SimCSE framework has been adopted and extended by various contributors, including adaptations for other languages such as Chinese, and integration into platforms like HuggingFace Spaces. This collaboration highlights the project's flexibility and broad relevance in NLP circles.