OCRmyPDF - Make Scanned PDFs Searchable with OCR Text Layers

Introduction to OCRmyPDF

OCRmyPDF is an innovative tool designed to add an OCR (Optical Character Recognition) text layer to scanned PDF files, enabling users to search, copy, and paste content easily. The tool is particularly valuable for converting PDFs that consist only of images into searchable and accessible documents.

Features of OCRmyPDF

Searchable PDFs: Converts regular PDFs into searchable PDF/A files, an ISO-standardized version ideal for long-term preservation.
Accurate OCR Text Placement: Places OCR text right beneath the image to maintain alignment making copying and pasting straightforward.
Resolution Maintenance: Keeps the original resolution of embedded images, ensuring no quality loss.
Lossless Operation: Inserts OCR information without altering other content, facilitating seamless integration.
File Optimization: Often compresses and optimizes PDFs, resulting in smaller file sizes than the original.
Image Improvement: Offers features like deskewing and cleaning of images which enhance clarity before OCR processing.
Multi-Core Processing: Utilizes all available CPU cores to efficiently handle processing tasks.
Multi-language Support: Uses Tesseract OCR, supporting over 100 languages, accommodating multiple language texts in a single document.
Scalability: Efficiently handles large files with thousands of pages.

Motivation Behind OCRmyPDF

The development of OCRmyPDF arose from the necessity for a reliable command-line tool capable of OCR processing for PDF files. Existing solutions often had drawbacks such as incorrect text placement, handling of multilingual characters poorly, altering image resolution, and generating large or invalid PDF files. None were able to generate PDF/A files. To overcome these challenges, OCRmyPDF was created to address these gaps.

Installation

OCRmyPDF is compatible with various operating systems, including Linux, Windows, macOS, and FreeBSD. Docker images are available for both x64 and ARM architectures. The tool can be installed using the conventional package management systems for different platforms:

Debian/Ubuntu: apt install ocrmypdf
Fedora: dnf install ocrmypdf
macOS (Homebrew): brew install ocrmypdf
macOS (MacPorts): port install ocrmypdf
FreeBSD: pkg install py-ocrmypdf
Conda: conda install ocrmypdf

For more installation details, refer to the documentation.

Language Support

OCRmyPDF leverages Tesseract for OCR capabilities, which requires language packs to function optimally. Linux users can typically find these packs via their system’s package manager. The language packs can be utilized by passing them through the -l LANG argument in the command line, allowing the processing of multiple languages simultaneously.

Documentation and Support

OCRmyPDF includes a comprehensive built-in help feature that can be accessed via ocrmypdf --help. Detailed documentation is available on Read the Docs. For any issues, users are encouraged to report them on the GitHub issues page.

Requirements

OCRmyPDF requires Python 3.8+ along with Ghostscript and Tesseract OCR installations. The program is written in pure Python and is compatible across numerous operating platforms.

Licensing

The project is open-source, distributed under the Mozilla Public License 2.0 (MPL-2.0), allowing integration with other codes and requiring users to publish any modifications they make to OCRmyPDF itself.

Business Enquiries

OCRmyPDF invites collaboration for feature development and consulting inquiries, offering to extend its feature set or assist in integration into larger systems.

For more details and advanced configurations, please refer to the official OCRmyPDF documentation.