Octopii - Enhancing Cybersecurity with Automated PII Detection via OCR and NLP

Introduction to Octopii

Octopii is an innovative tool developed by RedHunt Labs, designed to scan and identify Personally Identifiable Information (PII) in public-facing locations. It employs various technologies such as Optical Character Recognition (OCR), regular expressions, and Natural Language Processing (NLP) to search through images, PDFs, and documents for sensitive information like government IDs, addresses, emails, and more.

Why Octopii?

In the field of cybersecurity, data leaks, especially involving PII, are often underestimated. RedHunt Labs found that many organizations have misconfigured servers leading to constant leaks of employee and customer information. This sensitive data could include their IDs, contact information, and even locations. Such leaks provide valuable information to malicious entities. Octopii was created to demonstrate and automate the discovery and extraction of these leaks, helping organizations better secure their data.

How to Use Octopii

Installing Dependencies

To get started with Octopii, users must install essential software dependencies:

Use pip install -r requirements.txt to install required Python packages.
Install the Tesseract OCR tool with sudo apt install tesseract-ocr -y for Ubuntu or sudo pacman -Syu tesseract for Arch Linux.
For NLP capabilities, install Spacy language models with python -m spacy download en_core_web_sm.

Running the Tool

Once the setup is complete, Octopii can be run using a simple command:

python3 octopii.py <location to scan>

This <location to scan> can be a file, directory, S3 URL, or an Apache open directory listing. Users can also provide individual image URLs or files directly to the tool.

Example

By running Octopii on sample data, such as the provided dummy-pii/ folder, users receive detailed output about any detected PII:

owais@artemis ~ $ python3 octopii.py dummy-pii/

Searching for PII in dummy-pii/dummy-drivers-license-nebraska-us.jpg
{
    "file_path": "dummy-pii/dummy-drivers-license-nebraska-us.jpg",
    "pii_class": "Nebraska Driver's License",
    "country_of_origin": "United States",
    "faces": 1,
    "phone_numbers": ["4000002170"],
    "addresses": ["Nebraska"]
}

An output.txt file is generated for all results, updated in real-time as the tool processes files.

How Octopii Works

Octopii conducts its operations through several structured steps:

Input and Importing: The tool identifies and reads images or text-based files from local storage, S3 buckets, or open directories.
Face Detection: Uses Haar cascade method to detect faces within images, providing insights into how many times a face appears in PII materials.
Image Cleaning and Text Extraction: Conducts multiple processes, such as auto-rotation and grayscaling, to prepare images for text extraction via OCR.
Optical Character Recognition (OCR) and NLP: Extracts and interprets text from images using Tesseract. The text is then analyzed to identify PII by comparing detected words against a defined list of sensitive keywords.
Output: Provides comprehensive output detailing the file path, type and origin of PII, identifiable data, emails, phone numbers, and addresses detected.

Contributions and Credits

Contributors to Octopii can find guidance on contributing through the contributing guideline. The project acknowledges the use of tools such as BeautifulSoup, Tesseract, SciKit, OpenCV, and others, which support its robust functionality.

Important Note

Octopii is intended solely for research and educational purposes. RedHunt Labs is not responsible for any malicious misuse of this tool. Protected under the MIT License, this project ensures that all its operations adhere to secure and ethical standards.

Authored by Owais Shaikh, this unique tool stands as a pivotal resource in PII management and cybersecurity innovation.