splade - Leveraging BERT for Enhanced Sparse Lexical Retrieval Models

Introduction to SPLADE

SPLADE, short for Sparse Lexical and Expansion Model, is a neural retrieval model developed to improve information retrieval through sparse representations. It is built on the BERT architecture and focuses on enhancing query and document understanding by expanding them into sparse vectors. This allows for efficient indexing and retrieval, providing several advantages over dense models, such as better interpretable lexical matches and improved generalization on a range of datasets.

Recent Updates

The SPLADE project is actively maintained, and several recent updates have improved its functionality:

November 2023: Enhanced training codes for SPLADE and rerankers like cross encoders and RankT5 were introduced, with more models anticipated soon.
July 2023: Introduced static pruning for SPLADE indexes to help reproduce results from studies on sparse neural retrievers.
May 2023: A new branch based on Hugging Face Trainer for training with several negative samples was added.
April 2023: Model weights were moved to Hugging Face for better accessibility.

Overview of SPLADE

SPLADE models leverage the strength of sparse representations for text retrieval tasks, providing an effective balance between performance and computational efficiency. This is achieved through:

Sparse Regularization: It aids in creating a more manageable inverted index, crucial for high-speed retrieval and interpretability.
Integration with BERT: By utilizing BERT's masked language model head, SPLADE benefits from advanced natural language processing capabilities.

SPLADE Versions

SPLADE v1: The original model, presented in a SIGIR 2021 short paper, introduced the basic architecture which uses sparse lexical expansions for ranking.
SPLADE v2: This version incorporates improvements like hard-negative mining, distillation, and enhanced model initialization, as discussed in an arXiv paper published in 2021.
SPLADE++ (v2bis): An extension of SPLADE v2, this iteration focuses on effectiveness through advanced distillation techniques and hard-negative sampling, achieving success in both in-domain and zero-shot settings.
Efficient SPLADE: This model version aims to match the latency of traditional retrieval systems like BM25 without significant performance loss, as seen in SIGIR 2022 research.

Getting Started with SPLADE

Requirements

To work with SPLADE, it is recommended to create a new environment using Conda and install the necessary packages:

conda create -n splade_env python=3.9
conda activate splade_env
conda env create -f conda_splade_env.yml

Using SPLADE

Inference with Models: SPLADE provides several pre-trained models available on Hugging Face for use, such as naver/splade_v2_max and naver/splade_v2_distil.
Training and Evaluation: The project supports training with data from sources like MS MARCO and utilizes datasets for distillation and negative samples.
High-Level Code Structure: Functions for training, indexing, and retrieval are organized efficiently within the repository. Experiments are managed using the Hydra framework.

Data Utilized

SPLADE works with standardized datasets such as MS MARCO for training and supports additional data configurations for different training settings. These include datasets for distillation and negative mining, which are crucial for enhancing model accuracy.

Example Usage

To execute all necessary steps, SPLADE provides a quick-start command which handles model training, indexing, and retrieval:

conda activate splade_env
export PYTHONPATH=$PYTHONPATH:$(pwd)
export SPLADE_CONFIG_NAME="config_default.yaml"
python3 -m splade.all \
  config.checkpoint_dir=experiments/debug/checkpoint \
  config.index_dir=experiments/debug/index \
  config.out_dir=experiments/debug/out

Evaluating Performance

SPLADE models can be evaluated on benchmarks like BEIR, with comprehensive tooling available for potential use with PISA or Anserini for evaluation purposes.

Conclusion

SPLADE stands out as a versatile and effective tool for information retrieval tasks, harnessing the strengths of sparse lexical expansions and advanced neural techniques. Its adaptability across various settings and continual development by Naver Corp. make it a powerful option for researchers and developers in the field of information retrieval.