ChatPDF - Improve Document-Based Knowledge Retrieval with Local LLMs and GraphRAG

Introduction to ChatPDF

ChatPDF is an innovative project that allows users to interact with PDFs and other document formats using a conversational interface built on local Large Language Models (LLMs). The project's aim is to provide an efficient and accurate method for performing knowledge retrieval and answering questions based on document content, utilizing advanced language processing technologies.

Features

GraphRAG Implementation: ChatPDF includes a lightweight implementation of GraphRAG (Retrieval-Augmented Generation) that supports document QA via relational graph retrieval in a local mode.
API & LLM Support: The system integrates seamlessly with various APIs, including OpenAI, Deepseek, and Ollama, and is extendable to support additional LLMs. It also supports diverse embedding technologies like OpenAI embedding, local text2vec embedding, as well as Hugging Face and Sentence-Transformers embeddings.
Asynchronous Development: ChatPDF supports asynchronous requests to multiple APIs simultaneously, improving the system's efficiency and speed.
Wide Range of LLMs: This project supports several open-source LLM models such as ChatGLM3-6b, Chinese-LLaMA-Alpaca-2, Baichuan, and YI.
File Format Compatibility: ChatPDF is versatile in that it can handle multiple document formats, including PDF, docx, markdown, and txt.
RAG Accuracy Optimization: The project has introduced several optimizations for improving RAG accuracy:
- Optimized Chinese chunk splitting for better handling of multilingual documents.
- Improved embeddings with text2vec sentence embedding and similarity matching algorithms.
- Retrieval match enhancements using jieba tokenization with rank_BM25 for better lexical match on query keywords and weighted combination of lexical similarity with sentence embeddings to refine the corpus candidate set.
- A reranker module is added to rank candidates from retrieval processes, reducing candidate confusion and enhancing selection accuracy. Users can configure the reranker model with the rerank_model_name_or_path parameter.
- Added candidate chunk context extension with the num_expand_context_chunk parameter to adjust the context window size of the selected chunks.
RAG Model Optimization: The project is optimized for base model performance and supports custom RAG models, with the option to set the base model using the generate_model_name_or_path parameter.
Gradio Interface: Developed using Gradio, the project provides a user-friendly RAG dialogue page, supporting stream-style conversations for dynamic interaction.

How It Works

ChatPDF operates by embedding the content of documents into a vector space. When a user poses a question, the system retrieves relevant document sections using a combination of lexical and contextual similarity measures. It utilizes advanced natural language processing techniques to produce coherent and contextually relevant responses.

ChatPDF Operational Illustration

Getting Started

Installation

To get started with ChatPDF, install the required dependencies:

pip install -r requirements.txt

For Windows users, using WSL with a Linux environment is recommended, especially if CUDA support is needed for running large models. Setting up the Douban source can help with download speeds.

Utilizing the RAG Example

Run the example using the following command. It may require python or python3 depending on your system configuration:

CUDA_VISIBLE_DEVICES=0 python rag.py

Launching the Gradio Web Service

Initiate the Gradio web service with:

CUDA_VISIBLE_DEVICES=0 python webui.py --corpus_files data/sample.pdf --share

Access the ChatPDF interface at http://localhost:7860 through a web browser.

GraphRAG Demonstration

To run a demonstration of GraphRAG, please set the OpenAI API key in your environment as follows:

export OPENAI_API_KEY="sk-..."

Then execute:

python graphrag_demo.py

Contact and Contribution

For suggestions or improvements, reach out via GitHub issues or contact [email protected]. Contributions are welcome as the project seeks to refine its capabilities and expand its utility.

ChatPDF is released under The Apache License 2.0, allowing free commercial use, provided that credit to ChatPDF is maintained in the product documentation.

Related projects cater to similar interests, notably the MedicalGPT project, which focuses on developing advanced GPT models with incremental pre-training and fine-tuning capabilities.