LLMTest_NeedleInAHaystack - Focus on Assessed Retrieval Capabilities in Long Context LLMs

Needle In A Haystack - Pressure Testing LLMs

The "Needle In A Haystack" project is an intriguing endeavor aimed at evaluating the in-context retrieval abilities of long-context Language Model Models (LLMs). This project provides a straightforward but effective way to test how well various AI models can locate a small, specific piece of information (the "needle") hidden within a large amount of text (the "haystack"). Supported models are from renowned providers such as OpenAI, Anthropic, and Cohere.

The Test Explained

The core idea behind this test is to embed a random fact or statement, referred to as the "needle," somewhere within a large block of text, known as the "haystack." The task for the LLM is to correctly identify and retrieve this needle from several pages of unrelated context. By experimenting with different depths—indicating how deeply the needle is buried—and varying the length of the context, researchers measure the model's performance.

The original tests that kicked off this project set a foundational process found in a dedicated /original_results folder, but the testing script has evolved considerably since then.

Getting Started with the Project

Setting Up the Environment

To ensure that the Python dependencies are isolated and do not interfere with other system-wide configurations, it's best to use a virtual environment. Here's how to set it up:

python3 -m venv venv
source venv/bin/activate

Environment Variables and Installing the Package

API keys are crucial for interacting with the models, so users must set environment variables like NIAH_MODEL_API_KEY and NIAH_EVALUATOR_API_KEY, depending on which evaluation strategy is used. After setting these variables, install the package through PyPi with:

pip install needlehaystack

To run a test, execute the needlehaystack.run_test command from the command line, specifying your chosen model provider, among other arguments. Here's an example for an OpenAI model:

needlehaystack.run_test --provider openai --model_name "gpt-3.5-turbo-0125" --document_depth_percents "[50]" --context_lengths "[2000]"

For Collaborators

If you're looking to make a contribution, start by forking and cloning the repository. Next, set up the environment and install the package in editable mode to allow modifications on the go:

pip install -e .

Key Parameters for Testing

The project offers flexible parameters like model_to_test, evaluator, needle, and haystack_dir that allow for customized testing scenarios. You can adjust parameters to change needle placement or run multiple tests simultaneously, taking care of rate limits.

Advanced Features

Multi Needle Evaluator:

For a more complex scenario, multi-needle testing is possible. By enabling --multi_needle True, multiple needles can be inserted at calculated intervals to thoroughly test the model's retrieval capabilities.

LangSmith Evaluator:

To coordinate evaluations and archive results, the LangSmith tool can be utilized. After signing up and configuring environment variables, users can create datasets for various scenarios and run tests using --evaluator langsmith.

Visualizing Results

The LLMNeedleInHaystackVisualization.ipynb notebook includes tools for creating visual representations of the test results. These visual aids can be transferred to platforms like Google Slides for further enhancement and sharing.

Conclusion

The Needle In A Haystack project not only tests the limits of LLMs' data retrieval capabilities but also provides a robust framework for evaluating and improving these AI systems. Licensed under the MIT License, it encourages contributions and use while ensuring proper credit to the original creators. This initiative opens new frontiers in understanding how AI processes and retrieves information in a sea of data.