RAGatouille - Streamlined Advanced Retrieval in RAG Pipelines

Introduction to RAGatouille

RAGatouille is an innovative project designed to streamline and enhance the implementation of Retrieval-Augmented Generation (RAG) pipelines by integrating cutting-edge retrieval methods. It emphasizes modularity and ease of use, all while being grounded in advanced research. The project comes to life as a bridge connecting sophisticated research findings with practical RAG pipeline applications.

Motivation Behind RAGatouille

The primary goal of RAGatouille is to close the gap between advanced retrieval research and its application in RAG pipelines. RAG involves multiple components, with retrieval models being critically important. Traditionally, dense retrieval models such as OpenAI's text-ada-002 have been the go-to choice, but recent studies suggest that they may not always be the optimal solution. Recent information retrieval research highlights alternative models like ColBERT which are more efficient, adaptable to various domains, and suitable for low-resource languages.

Core Features

RAGatouille simplifies the use of state-of-the-art retrieval models, particularly aiming to make ColBERT models more accessible. By doing so, it ensures users can efficiently harness these models without diving deeply into the vast retrieval literature.

1. Training and Fine-Tuning

RAGatouille provides tools for training and fine-tuning retrieval models with ease. It employs a built-in TrainingDataProcessor in its RAGTrainer, capable of transforming various forms of input data into training-compatible triplets. This process includes deduplication, mapping of positives and negatives, and the mining of 'hard negatives' which are essential for effective training.

from ragatouille import RAGTrainer
my_data = [("What is the meaning of life ?", "The meaning of life is 42"), ...]
trainer = RAGTrainer()
trainer.prepare_training_data(raw_data=my_data)

2. Embedding and Indexing Documents

Creating indices for documents is made straightforward. By loading a trained model (either your own or a pretrained one from a hub), you can index documents swiftly. The indexing process involves document splitting, tokenization, term identification, embedding, and vector storage.

from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
index_path = RAG.index(index_name="my_index", collection=my_documents)

3. Retrieving Documents

Retrieving documents from an index is as easy as building one. Users can conduct searches on the indexed documents seamlessly. Results are returned as structured data that includes content, score, rank, and optional metadata if provided during indexing.

from ragatouille import RAGPretrainedModel
query = "ColBERT my dear ColBERT, who is the fairest document of them all?"
RAG = RAGPretrainedModel.from_index("path_to_your_index")
results = RAG.search(query)

Integration and Use Cases

RAGatouille is versatile and integrates well with existing systems. It can power projects as extensive as Spotify's vector search framework, evidencing its scalability and effectiveness. It also supports various integrations including:

ColBERT: Official implementations with easy API queries.
Vespa: A managed RAG engine offering extensive retrieval options.
FastRAG and LlamaIndex: Both offer support for ColBERT models, ensuring wide compatibility.

Getting Started

To use RAGatouille, simply install it via pip:

pip install ragatouille

Note that it requires a Linux environment or WSL2 on Windows to function correctly.

Conclusion

RAGatouille empowers users to leverage the full potential of RAG pipelines with minimal effort. By focusing on accessible and modular retrieval methods, it allows users to adapt and optimize their models to match specific needs and domains, opening up new possibilities in information retrieval and knowledge generation.