neural-cherche - Simplified Neural Search Model Optimization and Usage Across Multiple Platforms

Introduction to Neural-Cherche

Neural-Cherche is an innovative library tailored for optimizing neural search models such as Splade, ColBERT, and SparseEmbed. Designed for fine-tuning on specific datasets, it also offers classes for efficient inference on both fine-tuned retrievers and rankers. The primary goal of Neural-Cherche is to provide an accessible yet potent solution for refining and deploying neural search models in various environments, be it offline or online. Additionally, it allows users to store all computed embeddings, thereby avoiding unnecessary recalculations.

Neural-Cherche is flexible, operating on CPUs, GPUs, and MPS devices. For instance, ColBERT can be fine-tuned from any pre-trained Sentence Transformer checkpoint, while Splade and SparseEmbed require a model pre-trained with a masked language model (MLM).

Getting Started with Neural-Cherche

Installation

Installing Neural-Cherche is straightforward. Users can set it up using the following command:

pip install neural-cherche

For those who wish to evaluate their model during training, the library can be installed with an evaluation package:

pip install "neural-cherche[eval]"

Documentation

Comprehensive documentation can be accessed here, providing detailed guidance on utilizing Neural-Cherche's features.

Quick Start Guide

Training with Neural-Cherche involves using a dataset of triples (anchor, positive, negative). Here, the "anchor" is a query, the "positive" is a document directly related to the anchor, and the "negative" is a non-relevant document.

For example:

X = [
    ("anchor 1", "positive 1", "negative 1"),
    ("anchor 2", "positive 2", "negative 2"),
    ("anchor 3", "positive 3", "negative 3"),
]

Here's a simple way to fine-tune ColBERT using a pre-trained Sentence Transformer checkpoint with Neural-Cherche:

import torch
from neural_cherche import models, utils, train

model = models.ColBERT(
    model_name_or_path="raphaelsty/neural-cherche-colbert",
    device="cuda" if torch.cuda.is_available() else "cpu" # or mps
)

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-6)

X = [
    ("query", "positive document", "negative document"),
    ("query", "positive document", "negative document"),
    ("query", "positive document", "negative document"),
]

for step, (anchor, positive, negative) in enumerate(utils.iter(
        X,
        epochs=1, # specify number of epochs
        batch_size=8, # set triples per batch
        shuffle=True
    )):

    loss = train.train_colbert(
        model=model,
        optimizer=optimizer,
        anchor=anchor,
        positive=positive,
        negative=negative,
        step=step,
        gradient_accumulation_steps=50,
    )

    if (step + 1) % 1000 == 0:
        # Save the model every 1000 steps
        model.save_pretrained("checkpoint")

Document Retrieval

Once the model is fine-tuned, it can be used for document retrieval and re-ranking. Here's how to use a fine-tuned ColBERT model:

import torch
from lenlp import sparse
from neural_cherche import models, rank, retrieve

documents = [
    {"id": "doc1", "title": "Paris", "text": "Paris is the capital of France."},
    {"id": "doc2", "title": "Montreal", "text": "Montreal is the largest city in Quebec."},
    {"id": "doc3", "title": "Bordeaux", "text": "Bordeaux is in Southwestern France."},
]

retriever = retrieve.BM25(
    key="id",
    on=["title", "text"],
    count_vectorizer=sparse.CountVectorizer(
        normalize=True, ngram_range=(3, 5), analyzer="char_wb", stop_words=[]
    ),
    k1=1.5,
    b=0.75,
    epsilon=0.0,
)

model = models.ColBERT(
    model_name_or_path="raphaelsty/neural-cherche-colbert",
    device="cuda" if torch.cuda.is_available() else "cpu",  # or mps
)

ranker = rank.ColBERT(
    key="id",
    on=["title", "text"],
    model=model,
)

documents_embeddings = retriever.encode_documents(
    documents=documents,
)

retriever.add(
    documents_embeddings=documents_embeddings,
)

queries = ["Paris", "Montreal", "Bordeaux"]

queries_embeddings = retriever.encode_queries(
    queries=queries,
)

ranker_queries_embeddings = ranker.encode_queries(
    queries=queries,
)

candidates = retriever(
    queries_embeddings=queries_embeddings,
    batch_size=32,
    k=100,  # documents to retrieve
)

ranker_documents_embeddings = ranker.encode_candidates_documents(
    candidates=candidates,
    documents=documents,
    batch_size=32,
)

scores = ranker(
    queries_embeddings=ranker_queries_embeddings,
    documents_embeddings=ranker_documents_embeddings,
    documents=candidates,
    batch_size=32,
)

scores

Here, Neural-Cherche capabilities shine through, with its SparseEmbed, SPLADE, TFIDF, and BM25 retrievers, and a ColBERT ranker that optimizes retrieval processes.

Pre-trained Models

Neural-Cherche offers pre-trained checkpoints optimized for further fine-tuning. Examples include raphaelsty/neural-cherche-sparse-embed and raphaelsty/neural-cherche-colbert. These have been fine-tuned on subsets of the MS-MARCO dataset and could benefit from additional fine-tuning on user-specific datasets.

Performance Metrics

Neural-Cherche has been rigorously tested on the SciFact dataset, indicating robust performance across various metrics such as ndcg@10, hits@10, and hits@1, with distinct retrievers and rankers.

Community Contributions

Notably, contributions from developers like Benjamin Clavié and Arthur Satouf have been instrumental in the evolution of Neural-Cherche.

License and References

Neural-Cherche is available under the MIT open-source license, with some components, like splade, limited to non-commercial use. Others, such as SparseEmbed and ColBERT, are entirely open-source, permitting commercial applications. For further reading on related models, detailed references to scholarly articles are available.