Introduction to ChatPDF
ChatPDF is an innovative project that allows users to interact with PDFs and other document formats using a conversational interface built on local Large Language Models (LLMs). The project's aim is to provide an efficient and accurate method for performing knowledge retrieval and answering questions based on document content, utilizing advanced language processing technologies.
Features
-
GraphRAG Implementation: ChatPDF includes a lightweight implementation of GraphRAG (Retrieval-Augmented Generation) that supports document QA via relational graph retrieval in a
local
mode. -
API & LLM Support: The system integrates seamlessly with various APIs, including OpenAI, Deepseek, and Ollama, and is extendable to support additional LLMs. It also supports diverse embedding technologies like OpenAI embedding, local text2vec embedding, as well as Hugging Face and Sentence-Transformers embeddings.
-
Asynchronous Development: ChatPDF supports asynchronous requests to multiple APIs simultaneously, improving the system's efficiency and speed.
-
Wide Range of LLMs: This project supports several open-source LLM models such as ChatGLM3-6b, Chinese-LLaMA-Alpaca-2, Baichuan, and YI.
-
File Format Compatibility: ChatPDF is versatile in that it can handle multiple document formats, including PDF, docx, markdown, and txt.
-
RAG Accuracy Optimization: The project has introduced several optimizations for improving RAG accuracy:
- Optimized Chinese chunk splitting for better handling of multilingual documents.
- Improved embeddings with text2vec sentence embedding and similarity matching algorithms.
- Retrieval match enhancements using jieba tokenization with rank_BM25 for better lexical match on query keywords and weighted combination of lexical similarity with sentence embeddings to refine the corpus candidate set.
- A reranker module is added to rank candidates from retrieval processes, reducing candidate confusion and enhancing selection accuracy. Users can configure the reranker model with the
rerank_model_name_or_path
parameter. - Added candidate chunk context extension with the
num_expand_context_chunk
parameter to adjust the context window size of the selected chunks.
-
RAG Model Optimization: The project is optimized for base model performance and supports custom RAG models, with the option to set the base model using the
generate_model_name_or_path
parameter. -
Gradio Interface: Developed using Gradio, the project provides a user-friendly RAG dialogue page, supporting stream-style conversations for dynamic interaction.
How It Works
ChatPDF operates by embedding the content of documents into a vector space. When a user poses a question, the system retrieves relevant document sections using a combination of lexical and contextual similarity measures. It utilizes advanced natural language processing techniques to produce coherent and contextually relevant responses.
Getting Started
Installation
To get started with ChatPDF, install the required dependencies:
pip install -r requirements.txt
For Windows users, using WSL with a Linux environment is recommended, especially if CUDA support is needed for running large models. Setting up the Douban source can help with download speeds.
Utilizing the RAG Example
Run the example using the following command. It may require python
or python3
depending on your system configuration:
CUDA_VISIBLE_DEVICES=0 python rag.py
Launching the Gradio Web Service
Initiate the Gradio web service with:
CUDA_VISIBLE_DEVICES=0 python webui.py --corpus_files data/sample.pdf --share
Access the ChatPDF interface at http://localhost:7860
through a web browser.
GraphRAG Demonstration
To run a demonstration of GraphRAG, please set the OpenAI API key in your environment as follows:
export OPENAI_API_KEY="sk-..."
Then execute:
python graphrag_demo.py
Contact and Contribution
For suggestions or improvements, reach out via GitHub issues or contact [email protected]. Contributions are welcome as the project seeks to refine its capabilities and expand its utility.
ChatPDF is released under The Apache License 2.0, allowing free commercial use, provided that credit to ChatPDF is maintained in the product documentation.
Related projects cater to similar interests, notably the MedicalGPT project, which focuses on developing advanced GPT models with incremental pre-training and fine-tuning capabilities.