llm_aided_ocr - Improving OCR Precision through Advanced LLM Integration and Error Correction

LLM-Aided OCR Project

Introduction

The LLM-Aided OCR Project represents a groundbreaking approach to improving Optical Character Recognition (OCR) outcomes. By integrating sophisticated natural language processing methods and large language models (LLMs), this project elevates basic OCR text, transforming it into accurate, well-structured, and easy-to-read documents.

Example Outputs

Users can explore the capabilities of the LLM-Aided OCR Project by reviewing the provided outputs:

Features

The project offers a comprehensive set of features aimed at enhancing the OCR process:

Converts PDFs into images for processing.
Utilizes Tesseract for initial OCR.
Implements advanced error correction through LLMs, either locally or via APIs.
Divides text efficiently into chunks for faster processing.
Provides an optional Markdown formatting feature.
Allows optional suppression of headers and page numbers for clarity.
Evaluates the quality of the final output.
Supports both local and cloud-based LLMs, including OpenAI and Anthropic.
Features asynchronous processing to boost performance.
Includes detailed logging for tracking and debugging.
Utilizes GPU acceleration for local LLM tasks.

Detailed Technical Overview

PDF Processing and OCR

PDF to Image Conversion: Converts PDF pages into images for OCR using the pdf2image library, with options to limit the number of pages processed.
OCR Processing: Extracts text from images with pytesseract, enhanced by preprocessing steps like grayscale conversion and text enhancement techniques.

Text Processing Pipeline

Chunk Creation: Splits text into meaningful chunks, maintaining context by overlapping the chunks slightly.
Error Correction and Formatting: Uses LLMs to correct and format OCR errors, optionally converting text into Markdown format.
Duplicate Content Removal: Detects and removes repeated content to ensure clarity and conciseness.
Header and Page Number Suppression: Configurable settings allow for the removal or distinct formatting of headers and page numbers.

LLM Integration

Flexible LLM Support: Supports both local and cloud-based models, configurable via environment settings.
Local and API-Based Handling: Offers functions for both local inference and API requests, with robust error handling and token management.
Asynchronous Processing: Utilizes asynchronous programming for efficient chunk processing during API-based LLM tasks.

Token Management

Ensures efficient token usage with functions for estimation and dynamic adjustment based on content size and model constraints.

Quality Assessment

Evaluates and scores the quality of the processed output compared to the original OCR text.

Logging and Error Handling

Provides comprehensive logging and error messages to facilitate debugging, while filtering out unnecessary HTTP request logs.

Configuration and Customization

Settings are managed through a .env file, allowing easy configuration of LLM usage, API provider selections, model choices, and formatting preferences.

Output and File Handling

The script produces several distinct output files, including both the raw and corrected texts, accompanied by logs detailing the processing steps and outcomes.

Requirements

The project runs on Python 3.12+ and requires specific libraries and APIs to function, including Tesseract and optional LLM-supporting APIs.

Installation

The installation process involves setting up Python, creating a virtual environment, installing dependencies, and configuring environment variables.

Usage

To utilize the project, users place their PDF in the project directory, update the script with the PDF filename, and execute the script for processing.

How It Works

The project follows a structured process from converting the PDF to images, applying OCR, chunking and error correction, to final formatting and quality assessment.

Code Optimization

Efforts are made to enhance performance through concurrency, context preservation in chunking, and adaptive token management.

Limitations and Future Improvements

The effectiveness of the system is linked to the quality of the LLM used, and processing large documents can be resource-intensive.

Contributing

The project welcomes contributions which can be made by forking the repository and submitting pull requests.

License

The project is open-source, licensed under the MIT License, encouraging collaboration and sharing.