bm25s - Implement High-Speed Text Search in Python Using Scipy Sparse Matrices

Introduction to BM25S: Ultra-Fast Document Ranking

BM25S is a cutting-edge library that implements the BM25 ranking algorithm using pure Python, specifically optimized with Scipy sparse matrices. This tool is designed to speed up text retrieval tasks by ranking documents efficiently based on a given query. BM25 is a cornerstone component in search engines and has a prominent presence in systems like Elasticsearch.

Key Features

Speed: BM25S is engineered to be extremely fast due to its underlying use of Scipy's sparse matrices. This facilitates rapid scoring of documents at query time, offering performance improvements several magnitudes above existing libraries.
Simplicity: The library is straightforward to install and integrate, requiring no dependencies on larger frameworks or languages like Java. Essential dependencies include Scipy and Numpy, with optional light dependencies for stemming purposes.

Performance Comparison

BM25S demonstrates significant speed advantages over other popular tools like Elasticsearch and rank-bm25 (a common Python implementation of BM25). Throughput, or queries per second, has been drastically improved across various datasets.

Installation and Usage

Installing BM25S is easy with pip:

pip install bm25s

For better performance and advanced features like stemming, additional dependencies can be installed:

pip install bm25s[full]
pip install PyStemmer
pip install jax[cpu]

Quickstart Guide

Below is a simple example to get started with BM25S:

import bm25s
import Stemmer  # Optional for stemming

# Define a corpus
corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims"
]

# Optional: Setup a stemmer
stemmer = Stemmer.Stemmer("english")

# Tokenize the corpus, optionally using stopwords and stemming
corpus_tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer)

# Create and index using BM25
retriever = bm25s.BM25()
retriever.index(corpus_tokens)

# Example query
query = "does the fish purr like a cat?"
query_tokens = bm25s.tokenize(query, stemmer=stemmer)

# Retrieval of top-k documents
results, scores = retriever.retrieve(query_tokens, corpus=corpus, k=2)

for i in range(results.shape[1]):
    doc, score = results[0, i], scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f}): {doc}")

Flexibility and Customization

BM25S provides a flexible API allowing customization of the BM25 model and the tokenization process, accommodating specific needs in text processing and ranking. Users can define custom stopwords, integrate custom tokenization methods, and adjust model parameters to suit different data types and retrieval requirements.

Memory Usage Efficiency

To handle large datasets efficiently, BM25S introduces memory-mapping capabilities. This allows for loading BM25 indices without overwhelming system memory, an essential feature when dealing with massive corpuses.

Variants of BM25

BM25S includes various variants of the BM25 algorithm, like the original Robertson implementation, ATIRE, BM25L, BM25+, and Lucene's BM25, giving users the flexibility to choose the most suitable approach for their specific application.

Integration with Hugging Face

BM25S is integrated with Hugging Face's platform, allowing models and indices to be shared and accessed via the model hub. This feature facilitates collaborative advancements and leveraging community-contributed resources.

Benchmarks and Performance Highlights

BM25S outperforms various BM25 implementations in terms of query throughput and memory efficiency. Its streamlined approach and optimization through sparse matrix use illustrate significant improvements over existing solutions, both in speed and disk usage.

Conclusion

BM25S offers a fast, simple, and flexible solution for document ranking and retrieval, with advanced features that cater to modern needs in information retrieval. Its integration capability and efficient resource usage make it an ideal choice for developers and data scientists seeking powerful text retrieval tools.

For more information, updates, and details on contributing or using BM25S, explore the project’s GitHub repository, technical reports, and community threads linked in the project documentation.