Introduction to Neural-Cherche
Neural-Cherche is an innovative library tailored for optimizing neural search models such as Splade, ColBERT, and SparseEmbed. Designed for fine-tuning on specific datasets, it also offers classes for efficient inference on both fine-tuned retrievers and rankers. The primary goal of Neural-Cherche is to provide an accessible yet potent solution for refining and deploying neural search models in various environments, be it offline or online. Additionally, it allows users to store all computed embeddings, thereby avoiding unnecessary recalculations.
Neural-Cherche is flexible, operating on CPUs, GPUs, and MPS devices. For instance, ColBERT can be fine-tuned from any pre-trained Sentence Transformer checkpoint, while Splade and SparseEmbed require a model pre-trained with a masked language model (MLM).
Getting Started with Neural-Cherche
Installation
Installing Neural-Cherche is straightforward. Users can set it up using the following command:
pip install neural-cherche
For those who wish to evaluate their model during training, the library can be installed with an evaluation package:
pip install "neural-cherche[eval]"
Documentation
Comprehensive documentation can be accessed here, providing detailed guidance on utilizing Neural-Cherche's features.
Quick Start Guide
Training with Neural-Cherche involves using a dataset of triples (anchor, positive, negative)
. Here, the "anchor" is a query, the "positive" is a document directly related to the anchor, and the "negative" is a non-relevant document.
For example:
X = [
("anchor 1", "positive 1", "negative 1"),
("anchor 2", "positive 2", "negative 2"),
("anchor 3", "positive 3", "negative 3"),
]
Here's a simple way to fine-tune ColBERT using a pre-trained Sentence Transformer checkpoint with Neural-Cherche:
import torch
from neural_cherche import models, utils, train
model = models.ColBERT(
model_name_or_path="raphaelsty/neural-cherche-colbert",
device="cuda" if torch.cuda.is_available() else "cpu" # or mps
)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-6)
X = [
("query", "positive document", "negative document"),
("query", "positive document", "negative document"),
("query", "positive document", "negative document"),
]
for step, (anchor, positive, negative) in enumerate(utils.iter(
X,
epochs=1, # specify number of epochs
batch_size=8, # set triples per batch
shuffle=True
)):
loss = train.train_colbert(
model=model,
optimizer=optimizer,
anchor=anchor,
positive=positive,
negative=negative,
step=step,
gradient_accumulation_steps=50,
)
if (step + 1) % 1000 == 0:
# Save the model every 1000 steps
model.save_pretrained("checkpoint")
Document Retrieval
Once the model is fine-tuned, it can be used for document retrieval and re-ranking. Here's how to use a fine-tuned ColBERT model:
import torch
from lenlp import sparse
from neural_cherche import models, rank, retrieve
documents = [
{"id": "doc1", "title": "Paris", "text": "Paris is the capital of France."},
{"id": "doc2", "title": "Montreal", "text": "Montreal is the largest city in Quebec."},
{"id": "doc3", "title": "Bordeaux", "text": "Bordeaux is in Southwestern France."},
]
retriever = retrieve.BM25(
key="id",
on=["title", "text"],
count_vectorizer=sparse.CountVectorizer(
normalize=True, ngram_range=(3, 5), analyzer="char_wb", stop_words=[]
),
k1=1.5,
b=0.75,
epsilon=0.0,
)
model = models.ColBERT(
model_name_or_path="raphaelsty/neural-cherche-colbert",
device="cuda" if torch.cuda.is_available() else "cpu", # or mps
)
ranker = rank.ColBERT(
key="id",
on=["title", "text"],
model=model,
)
documents_embeddings = retriever.encode_documents(
documents=documents,
)
retriever.add(
documents_embeddings=documents_embeddings,
)
queries = ["Paris", "Montreal", "Bordeaux"]
queries_embeddings = retriever.encode_queries(
queries=queries,
)
ranker_queries_embeddings = ranker.encode_queries(
queries=queries,
)
candidates = retriever(
queries_embeddings=queries_embeddings,
batch_size=32,
k=100, # documents to retrieve
)
ranker_documents_embeddings = ranker.encode_candidates_documents(
candidates=candidates,
documents=documents,
batch_size=32,
)
scores = ranker(
queries_embeddings=ranker_queries_embeddings,
documents_embeddings=ranker_documents_embeddings,
documents=candidates,
batch_size=32,
)
scores
Here, Neural-Cherche capabilities shine through, with its SparseEmbed
, SPLADE
, TFIDF
, and BM25
retrievers, and a ColBERT
ranker that optimizes retrieval processes.
Pre-trained Models
Neural-Cherche offers pre-trained checkpoints optimized for further fine-tuning. Examples include raphaelsty/neural-cherche-sparse-embed and raphaelsty/neural-cherche-colbert. These have been fine-tuned on subsets of the MS-MARCO dataset and could benefit from additional fine-tuning on user-specific datasets.
Performance Metrics
Neural-Cherche has been rigorously tested on the SciFact dataset, indicating robust performance across various metrics such as ndcg@10, hits@10, and hits@1, with distinct retrievers and rankers.
Community Contributions
Notably, contributions from developers like Benjamin Clavié and Arthur Satouf have been instrumental in the evolution of Neural-Cherche.
License and References
Neural-Cherche is available under the MIT open-source license, with some components, like splade, limited to non-commercial use. Others, such as SparseEmbed and ColBERT, are entirely open-source, permitting commercial applications. For further reading on related models, detailed references to scholarly articles are available.