ExtractThinker - Streamline Document Extraction with LLM Technology for Efficient Workflows

ExtractThinker Project Overview

ExtractThinker is a remarkable library developed to facilitate data extraction from various files and documents, utilizing advanced Large Language Models (LLMs). With an ORM-style interaction approach, it empowers users to create flexible and powerful document extraction workflows effortlessly.

Features of ExtractThinker

ExtractThinker is packed with numerous features making it a versatile tool for document processing:

Multiple Document Loaders: It supports various document loaders, including Tesseract OCR, Azure Form Recognizer, AWS TextExtract, and Google Document AI, ensuring broad compatibility with different document types.
Customizable Extraction: Users can define custom extraction rules using contract definitions, tailoring the extraction process to suit their specific needs.
Asynchronous Processing: This feature allows for efficient document handling by processing tasks without waiting for one to finish before starting another, improving the overall speed and performance.
Support for Various Document Formats: ExtractThinker is built to handle a variety of document formats, making it a convenient choice for diverse document processing tasks.
ORM-Style Interaction: The library facilitates ORM-style interaction between files and LLMs, simplifying complex document extraction processes.

Installation

Getting started with ExtractThinker is straightforward. To install the package, users can simply run the following command:

pip install extract_thinker

Usage

ExtractThinker is designed to be easy to use. Below is an example to help users get started. This example demonstrates how to load a document using Tesseract OCR and extract specific fields.

import os
from dotenv import load_dotenv
from extract_thinker import DocumentLoaderTesseract, Extractor, Contract

load_dotenv()
cwd = os.getcwd()

class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str

tesseract_path = os.getenv("TESSERACT_PATH")
test_file_path = os.path.join(cwd, "test_images", "invoice.png")

extractor = Extractor()
extractor.load_document_loader(
    DocumentLoaderTesseract(tesseract_path)
)
extractor.load_llm("claude-3-haiku-20240307")

result = extractor.extract(test_file_path, InvoiceContract)

print("Invoice Number: ", result.invoice_number)
print("Invoice Date: ", result.invoice_date)

Document Splitting Example

ExtractThinker also supports document splitting and processing. Users can fragment and process documents efficiently using the library's capabilities. Here’s an example:

import os
from dotenv import load_dotenv
from extract_thinker import DocumentLoaderTesseract, Extractor, Process, Classification, ImageSplitter

load_dotenv()

class DriverLicense(Contract):
    # Define your DriverLicense contract fields here
    pass

class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str

extractor = Extractor()
extractor.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))
extractor.load_llm("gpt-3.5-turbo")

classifications = [
    Classification(name="Driver License", description="This is a driver license", contract=DriverLicense, extractor=extractor),
    Classification(name="Invoice", description="This is an invoice", contract=InvoiceContract, extractor=extractor)
]

process = Process()
process.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))
process.load_splitter(ImageSplitter())

path = "..."

split_content = process.load_file(path)\
    .split(classifications)\
    .extract()

# Process the split_content as needed

Infrastructure

ExtractThinker is inspired by the LangChain ecosystem, known for its modular infrastructure, featuring templates, components, and core functions. This design ensures robust and flexible document processing capabilities.

Why Choose ExtractThinker over LangChain?

While LangChain is a versatile framework catering to a wide range of use cases, ExtractThinker is laser-focused on Intelligent Document Processing (IDP). This specialization allows it to leverage LLMs effectively to push closer to achieving high accuracy in document extraction.

Contributing and Community

ExtractThinker is open to contributions from the community. Potential contributors can fork the repository, work on features or bug fixes, and submit a pull request. Engage with the community and explore more about ExtractThinker through contributions and discussions.

License and Contact

ExtractThinker is licensed under the Apache License 2.0. For more information or to address questions, users are encouraged to open an issue on the GitHub repository.

Conclusion

ExtractThinker stands out as a powerful and adaptable tool for intelligent document processing, making it an excellent choice for professionals and businesses seeking efficient document data extraction solutions.