WARC-GPT: Exploring Web Archives with AI
WARC-GPT is an innovative tool combining the power of Web ARChive (WARC) files and Artificial Intelligence (AI) to explore and retrieve information from web archives. It serves as an experimental Retrieval Augmented Generation (RAG) pipeline developed by the Library Innovation Lab (LIL) at Harvard Law School.
Features
WARC-GPT offers a diverse set of functionalities, making it a versatile tool for web archive exploration. At the heart of its capabilities lies the RAG pipeline for WARC files, allowing users to seamlessly interact with various Language Learning Models (LLMs), providers, and embedding models. Among its primary features are:
- A REST API and user-friendly Web UI.
- Advanced embeddings visualization techniques.
Installation
To get started with WARC-GPT, certain machine-level dependencies are required:
- Python 3.11 or higher
- Python Poetry, a tool for dependency management
Install WARC-GPT by cloning its repository from GitHub and setting up the environment using Poetry or a virtual environment, as detailed in the installation instructions.
Configuring the Application
Configuration is handled through environment variables. Users can personalize settings by editing a copied .env
file. WARC-GPT supports integration with both the OpenAI API and Ollama for local inference, making the system highly adaptable to various setups.
Ingesting WARCs
For users to begin exploring, WARC files are placed in a designated ./warc
directory. WARC-GPT processes these files to extract and embed text from text/html
and application/pdf
formats. These embeddings are saved in a vector store, creating a comprehensive knowledge base.
Starting the Server
WARC-GPT's server can be started using a simple command, making the web UI accessible via http://localhost:5000
. It facilitates interactions with the system's knowledge base to retrieve pertinent excerpts for any queries received.
Interacting with the Web UI
The web UI provides an interactive platform enabling users to perform RAG searches within the indexed knowledge base. The UI supports a chat history feature for enhanced prompting during interactions.
Interacting with the API
WARC-GPT's API comprises several endpoints for seamless integration:
[GET] /api/models
: Fetch available model listings.[POST] /api/search
: Conduct searches within the vector store given a user prompt.[POST] /api/complete
: Generate text completions using an LLM.
Each endpoint is structured to provide detailed responses to input requests, enabling users to perform complex interactions programmatically.
Visualizing Embeddings
Users can generate interactive 2D scatter plots using the T-SNE algorithm to visualize embeddings, enhancing the understanding of data structures within the vector store.
Disclaimer
The Library Innovation Lab, creators of WARC-GPT, focuses on longevity, authenticity, reliability, and privacy in its projects. While WARC-GPT is a prototype, it reflects these principles and is available open-source for experimental use. As an innovation lab, LIL encourages user feedback and participation in evolving this experimental tool.
In conclusion, WARC-GPT offers an open-source, exploratory space for users interested in leveraging AI to dive into web archives, propelled by a robust set of tools and community engagement facilitated by the Library Innovation Lab.