stark - STaRK Benchmark Evaluating LLM Retrieval in Knowledge Bases

STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

STaRK is an innovative project from Stanford University designed to benchmark large language models (LLMs) in retrieving information from textual and relational knowledge bases. Aimed at applications such as product searches, academic paper retrievals, and inquiries in biomedicine, STaRK sets a new standard in this domain by engaging with diverse, context-specific queries that mimic real-world scenarios.

Why STaRK?

Novel Task

STaRK tackles a unique challenge: how effectively LLMs can manage the intricate requirements posed by textual and relational data. This is especially important given the increasing complexity and variety of information retrieval needs in digital spaces.

Large-scale and Diverse Knowledge Bases

To support this challenge, STaRK includes three expansive knowledge bases sourced from publicly available data. These comprehensive datasets enable extensive testing across different domains and applications.

Natural and Practical Queries

The hallmark of the STaRK benchmark is its set of queries. These are crafted to reflect realistic questions users might have, incorporating complex relational and textual elements. This approach ensures that the system is tested under practical conditions.

Accessing STaRK

Getting started with STaRK is straightforward. It is available through a pip package (stark-qa) and can be integrated into various environments supporting Python 3.8 through 3.11. The data is also accessible via the Hugging Face platform, which simplifies its integration into existing setups.

How to Use STaRK

Environment Setup: Install the STaRK package via pip or set it up from the source.
Data Loading and Integration: Utilize the stark_qa module to load datasets pertaining to specific subjects, like Amazon product data. This includes both the retrieval datasets and semi-structured knowledge bases.
Benchmark Evaluation: Install additional packages such as llm2vec, gritlm, and bm25 for evaluating retrieval tasks on the dataset. STaRK provides scripts for downloading and generating embeddings, facilitating immediate experimentation and evaluation.

Contributions to Research

STaRK is not only a tool for current applications but also a pathway to future advancements in information retrieval by providing researchers a robust dataset with which to innovate and assess new retrieval models. Its acceptance at the prestigious NeurIPS 2024 conference highlights its significance.

For Researchers

Researchers looking to cite STaRK in their work can reference the upcoming publication in NeurIPS Dataset & Benchmark Track 2024, which outlines the framework and findings related to STaRK's development and application.

@inproceedings{wu24stark,
    title        = {STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases},
    author       = {
        Shirley Wu and Shiyu Zhao and 
        Michihiro Yasunaga and Kexin Huang and 
        Kaidi Cao and Qian Huang and 
        Vassilis N. Ioannidis and Karthik Subbian and 
        James Zou and Jure Leskovec
    },
    booktitle    = {NeurIPS Datasets and Benchmarks Track},
    year         = {2024}
}

For more insights and detailed information, readers can explore STaRK's website, offering a comprehensive overview and access to various tools and resources related to the project.