warc-gpt - Explore Web Archives with AI-Driven Retrieval

WARC-GPT: Exploring Web Archives with AI

WARC-GPT is an innovative tool combining the power of Web ARChive (WARC) files and Artificial Intelligence (AI) to explore and retrieve information from web archives. It serves as an experimental Retrieval Augmented Generation (RAG) pipeline developed by the Library Innovation Lab (LIL) at Harvard Law School.

Features

WARC-GPT offers a diverse set of functionalities, making it a versatile tool for web archive exploration. At the heart of its capabilities lies the RAG pipeline for WARC files, allowing users to seamlessly interact with various Language Learning Models (LLMs), providers, and embedding models. Among its primary features are:

A REST API and user-friendly Web UI.
Advanced embeddings visualization techniques.

Installation

To get started with WARC-GPT, certain machine-level dependencies are required:

Python 3.11 or higher
Python Poetry, a tool for dependency management

Install WARC-GPT by cloning its repository from GitHub and setting up the environment using Poetry or a virtual environment, as detailed in the installation instructions.

Configuring the Application

Configuration is handled through environment variables. Users can personalize settings by editing a copied .env file. WARC-GPT supports integration with both the OpenAI API and Ollama for local inference, making the system highly adaptable to various setups.

Ingesting WARCs

For users to begin exploring, WARC files are placed in a designated ./warc directory. WARC-GPT processes these files to extract and embed text from text/html and application/pdf formats. These embeddings are saved in a vector store, creating a comprehensive knowledge base.

Starting the Server

WARC-GPT's server can be started using a simple command, making the web UI accessible via http://localhost:5000. It facilitates interactions with the system's knowledge base to retrieve pertinent excerpts for any queries received.

Interacting with the Web UI

The web UI provides an interactive platform enabling users to perform RAG searches within the indexed knowledge base. The UI supports a chat history feature for enhanced prompting during interactions.

Interacting with the API

WARC-GPT's API comprises several endpoints for seamless integration:

[GET] /api/models: Fetch available model listings.
[POST] /api/search: Conduct searches within the vector store given a user prompt.
[POST] /api/complete: Generate text completions using an LLM.

Each endpoint is structured to provide detailed responses to input requests, enabling users to perform complex interactions programmatically.

Visualizing Embeddings

Users can generate interactive 2D scatter plots using the T-SNE algorithm to visualize embeddings, enhancing the understanding of data structures within the vector store.

Disclaimer

The Library Innovation Lab, creators of WARC-GPT, focuses on longevity, authenticity, reliability, and privacy in its projects. While WARC-GPT is a prototype, it reflects these principles and is available open-source for experimental use. As an innovation lab, LIL encourages user feedback and participation in evolving this experimental tool.

In conclusion, WARC-GPT offers an open-source, exploratory space for users interested in leveraging AI to dive into web archives, propelled by a robust set of tools and community engagement facilitated by the Library Innovation Lab.