Zerox OCR: A Seamless Way to Digitize Documents
Overview
Zerox OCR is designed to make the optical character recognition (OCR) process straightforward, converting various document formats into images suitable for AI consumption. These documents often contain complex layouts with tables and charts, which can be challenging to process. Zerox simplifies this by:
- Receiving a document file (pdf, docx, image, etc.).
- Converting the file into a series of images.
- Using GPT to convert each image to Markdown text.
- Compiling the responses into a complete Markdown document.
This process leverages vision models for precise data extraction and format consistency, offering a user-friendly document processing tool. A hosted version is available for testing at: Zerox OCR Demo.
Getting Started
Zerox is accessible as both a Node.js and Python package, making it versatile across different development environments.
- Node Package: Detailed instructions and installation guides here.
- Python Package: Installation instructions are available here.
Node Zerox
To install Zerox on Node.js:
npm install zerox
Zerox relies on graphicsmagick
and ghostscript
for converting PDFs to images, usually handled automatically but may require manual installation on some systems.
Example Usage
Using a file URL:
import { zerox } from "zerox";
const result = await zerox({
filePath: "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf",
openaiAPIKey: process.env.OPENAI_API_KEY
});
From a local path:
import path from "path";
import { zerox } from "zerox";
const result = await zerox({
filePath: path.resolve(__dirname, "./cs101.pdf"),
openaiAPIKey: process.env.OPENAI_API_KEY
});
Configuration Options
const result = await zerox({
filePath: "path/to/file",
openaiAPIKey: process.env.OPENAI_API_KEY,
cleanup: true,
concurrency: 10,
maintainFormat: false,
model: 'gpt-4o-mini',
outputDir: undefined,
pagesToConvertAsImages: -1,
tempDir: "/os/tmp"
});
Key Features:
- The
maintainFormat
option ensures consistent formatting of pages, especially useful for documents with tables, albeit slowing down the process as it runs requests synchronously.
Sample Output
{
completionTime: 10038,
fileName: 'invoice_36258',
inputTokens: 25543,
outputTokens: 210,
pages: [
{
content: '# INVOICE # 36258\n' +
'**Date:** Mar 06 2012 \n' +
'**Ship Mode:** First Class \n' +
'**Balance Due:** $50.10 \n' + ...
page: 1,
contentLength: 747
}
]
}
Python Zerox
Supports various vision models from providers like OpenAI, Azure, Anthropic, and AWS Bedrock.
Installation Instructions
- Install
poppler-utils
for system compatibility. - Use
pip
to install Zerox for Python:
pip install py-zerox
Example Usage in Python
from pyzerox import zerox
import os
import asyncio
async def main():
file_path = "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf"
result = await zerox(file_path=file_path, model="gpt-4o-mini")
return result
result = asyncio.run(main())
print(result)
Supported File Types
Zerox supports an array of file types, converting non-image, and non-pdf files to images using libreoffice
and graphicsmagick
. Supported formats include:
- Document formats: PDF, DOC, DOCX, ODT, RTF, TXT, etc.
- Spreadsheet formats: XLS, XLSX, ODS, CSV, etc.
- Presentation formats: PPT, PPTX, ODP, etc.
License and Credits
Zerox is licensed under the MIT License and is powered by Litellm, supporting popular vision models from various providers.
In summary, Zerox OCR offers a robust solution for converting complex document formats into a machine-readable format for AI applications, ensuring accuracy and preserving the original structure's integrity.