loft - Evaluate Comprehensive Long-Context Language Models in Diverse Tasks

Introduction to the LOFT Project

The LOFT (Long Context Frontiers) project represents an ambitious benchmark dedicated to exploring the capabilities of large language models when tasked with handling extensive contexts, specifically over one million tokens. This initiative was inspired by the research titled "Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?" It serves a critical role in evaluating various models by using a diverse range of real-world tasks. These tasks are categorized into six significant areas, encompassing retrieval, multi-hop compositional reasoning, and additional applications like SQL.

Objective of the LOFT Project

The core objective of LOFT is to assess how effectively long-context language models can process, interpret, and utilize large amounts of data without needing separate components like retrieval and SQL engines. Essentially, LOFT invites researchers and developers to push the boundaries of what these models can achieve independently, potentially reducing reliance on supplementary processing tools.

Installation and Setup

To start working with LOFT, users can easily clone the repository and install the necessary dependencies:

$ git clone [email protected]:google-deepmind/loft.git
$ cd loft/
$ pip install -r requirements.txt

This initial setup ensures that users have everything they need to explore the datasets and engage with the benchmark tasks effectively.

Downloading Datasets and Prompts

The LOFT project offers a comprehensive script for downloading all relevant datasets. By executing the following command, users can save the datasets in a preferred directory:

$ BASE_DIR=your-choice-of-directory
$ sh download.sh $BASE_DIR

This directory will then contain a well-organized structure, featuring various datasets categorized under tasks such as retrieval, RAG, SQL, and more.

Task Categories and Datasets

LOFT includes 35 datasets spread across four modalities, allowing for a rich and varied set of tasks to benchmark. These are designed to test models in real-world scenarios including:

Text Retrieval: Tasks are aimed at argument retrieval, fact checking, question answering, and web search using datasets like ArguAna, FEVER, and MS MARCO.
Visual Retrieval: Tasks in this category involve image retrieval and video searches using datasets such as Flickr30k and MS COCO.
Audio Retrieval: This involves retrieving audio data using datasets like FLEURS-en and others, although some are still coming soon.
RAG and SQL Tasks: These involve reading, understanding, and generating structured data or knowledge queries.

Inference and Evaluation

The LOFT project provides robust support tools for inference and evaluation. For instance, one can run inference using the gemini-1.5-flash-002 model from VertexAI. This is facilitated by customizable scripts that specify dataset, project variables, and desired output paths:

python run_inference.py \
    --prompt_prefix_path ${BASE_DIR}/prompts/retrieval_128k/retrieval_${DATASET}_128k.txt \
    --data_dir ${BASE_DIR}/data/retrieval/${DATASET}/128k \
    --split dev \
    --context_length 128k \
    --output_path ${BASE_DIR}/outputs/retrieval/${DATASET}/128k/predictions.jsonl \
    --project_id ${PROJECT_ID}

These scripts help in generating predictions and can be paired with evaluation scripts to assess performance across various metrics like recall and exact match scores.

Conclusion

LOFT offers a comprehensive playground for testing the capabilities of language models in handling large context tasks autonomously. It challenges the traditional approach of using multiple tools for different tasks and promotes a streamlined method of utilizing advanced AI models. This project serves as both a resource and a benchmark for researchers aiming to extend the boundaries of current AI capabilities in language processing and beyond.