LLM-Aided OCR Project
Introduction
The LLM-Aided OCR Project represents a groundbreaking approach to improving Optical Character Recognition (OCR) outcomes. By integrating sophisticated natural language processing methods and large language models (LLMs), this project elevates basic OCR text, transforming it into accurate, well-structured, and easy-to-read documents.
Example Outputs
Users can explore the capabilities of the LLM-Aided OCR Project by reviewing the provided outputs:
Features
The project offers a comprehensive set of features aimed at enhancing the OCR process:
- Converts PDFs into images for processing.
- Utilizes Tesseract for initial OCR.
- Implements advanced error correction through LLMs, either locally or via APIs.
- Divides text efficiently into chunks for faster processing.
- Provides an optional Markdown formatting feature.
- Allows optional suppression of headers and page numbers for clarity.
- Evaluates the quality of the final output.
- Supports both local and cloud-based LLMs, including OpenAI and Anthropic.
- Features asynchronous processing to boost performance.
- Includes detailed logging for tracking and debugging.
- Utilizes GPU acceleration for local LLM tasks.
Detailed Technical Overview
PDF Processing and OCR
-
PDF to Image Conversion: Converts PDF pages into images for OCR using the
pdf2image
library, with options to limit the number of pages processed. -
OCR Processing: Extracts text from images with
pytesseract
, enhanced by preprocessing steps like grayscale conversion and text enhancement techniques.
Text Processing Pipeline
-
Chunk Creation: Splits text into meaningful chunks, maintaining context by overlapping the chunks slightly.
-
Error Correction and Formatting: Uses LLMs to correct and format OCR errors, optionally converting text into Markdown format.
-
Duplicate Content Removal: Detects and removes repeated content to ensure clarity and conciseness.
-
Header and Page Number Suppression: Configurable settings allow for the removal or distinct formatting of headers and page numbers.
LLM Integration
-
Flexible LLM Support: Supports both local and cloud-based models, configurable via environment settings.
-
Local and API-Based Handling: Offers functions for both local inference and API requests, with robust error handling and token management.
-
Asynchronous Processing: Utilizes asynchronous programming for efficient chunk processing during API-based LLM tasks.
Token Management
Ensures efficient token usage with functions for estimation and dynamic adjustment based on content size and model constraints.
Quality Assessment
Evaluates and scores the quality of the processed output compared to the original OCR text.
Logging and Error Handling
Provides comprehensive logging and error messages to facilitate debugging, while filtering out unnecessary HTTP request logs.
Configuration and Customization
Settings are managed through a .env
file, allowing easy configuration of LLM usage, API provider selections, model choices, and formatting preferences.
Output and File Handling
The script produces several distinct output files, including both the raw and corrected texts, accompanied by logs detailing the processing steps and outcomes.
Requirements
The project runs on Python 3.12+ and requires specific libraries and APIs to function, including Tesseract and optional LLM-supporting APIs.
Installation
The installation process involves setting up Python, creating a virtual environment, installing dependencies, and configuring environment variables.
Usage
To utilize the project, users place their PDF in the project directory, update the script with the PDF filename, and execute the script for processing.
How It Works
The project follows a structured process from converting the PDF to images, applying OCR, chunking and error correction, to final formatting and quality assessment.
Code Optimization
Efforts are made to enhance performance through concurrency, context preservation in chunking, and adaptive token management.
Limitations and Future Improvements
The effectiveness of the system is linked to the quality of the LLM used, and processing large documents can be resource-intensive.
Contributing
The project welcomes contributions which can be made by forking the repository and submitting pull requests.
License
The project is open-source, licensed under the MIT License, encouraging collaboration and sharing.