zerox - Effortlessly Enhance AI Systems with Comprehensive OCR and Document Conversion

Zerox OCR: A Seamless Way to Digitize Documents

Overview

Zerox OCR is designed to make the optical character recognition (OCR) process straightforward, converting various document formats into images suitable for AI consumption. These documents often contain complex layouts with tables and charts, which can be challenging to process. Zerox simplifies this by:

Receiving a document file (pdf, docx, image, etc.).
Converting the file into a series of images.
Using GPT to convert each image to Markdown text.
Compiling the responses into a complete Markdown document.

This process leverages vision models for precise data extraction and format consistency, offering a user-friendly document processing tool. A hosted version is available for testing at: Zerox OCR Demo.

Getting Started

Zerox is accessible as both a Node.js and Python package, making it versatile across different development environments.

Node Package: Detailed instructions and installation guides here.
Python Package: Installation instructions are available here.

Node Zerox

To install Zerox on Node.js:

npm install zerox

Zerox relies on graphicsmagick and ghostscript for converting PDFs to images, usually handled automatically but may require manual installation on some systems.

Example Usage

Using a file URL:

import { zerox } from "zerox";

const result = await zerox({
  filePath: "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf",
  openaiAPIKey: process.env.OPENAI_API_KEY
});

From a local path:

import path from "path";
import { zerox } from "zerox";

const result = await zerox({
  filePath: path.resolve(__dirname, "./cs101.pdf"),
  openaiAPIKey: process.env.OPENAI_API_KEY
});

Configuration Options

const result = await zerox({
  filePath: "path/to/file",
  openaiAPIKey: process.env.OPENAI_API_KEY,
  cleanup: true, 
  concurrency: 10,
  maintainFormat: false,
  model: 'gpt-4o-mini',
  outputDir: undefined,
  pagesToConvertAsImages: -1,
  tempDir: "/os/tmp"
});

Key Features:

The maintainFormat option ensures consistent formatting of pages, especially useful for documents with tables, albeit slowing down the process as it runs requests synchronously.

Sample Output

{
  completionTime: 10038,
  fileName: 'invoice_36258',
  inputTokens: 25543,
  outputTokens: 210,
  pages: [
    {
      content: '# INVOICE # 36258\n' +
      '**Date:** Mar 06 2012  \n' +
      '**Ship Mode:** First Class  \n' +
      '**Balance Due:** $50.10  \n' + ...
      page: 1,
      contentLength: 747
    }
  ]
}

Python Zerox

Supports various vision models from providers like OpenAI, Azure, Anthropic, and AWS Bedrock.

Installation Instructions

Install poppler-utils for system compatibility.
Use pip to install Zerox for Python:

pip install py-zerox

Example Usage in Python

from pyzerox import zerox
import os
import asyncio

async def main():
    file_path = "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf"
    result = await zerox(file_path=file_path, model="gpt-4o-mini")
    return result

result = asyncio.run(main())
print(result)

Supported File Types

Zerox supports an array of file types, converting non-image, and non-pdf files to images using libreoffice and graphicsmagick. Supported formats include:

Document formats: PDF, DOC, DOCX, ODT, RTF, TXT, etc.
Spreadsheet formats: XLS, XLSX, ODS, CSV, etc.
Presentation formats: PPT, PPTX, ODP, etc.

License and Credits

Zerox is licensed under the MIT License and is powered by Litellm, supporting popular vision models from various providers.

In summary, Zerox OCR offers a robust solution for converting complex document formats into a machine-readable format for AI applications, ensuring accuracy and preserving the original structure's integrity.