OCR_DataSet - OCR Dataset Compilation for Detection and Recognition Tasks

Introduction to the OCR_DataSet Project

The OCR_DataSet project is designed to be a comprehensive resource for those interested in optical character recognition (OCR). This project provides a wide range of datasets that are crucial for training and testing OCR technologies. Each dataset has been converted into a unified format to facilitate both detection and recognition tasks, making it easier for researchers and developers to work with these resources without needing to worry about compatibility issues.

Key Features of the OCR_DataSet Project:

Unified Dataset Format: The project has standardized various datasets to support consistent OCR tasks. This includes popular datasets like ICDAR2015, MLT2019, and COCO-Text_v2, among others. The consistent formatting covers both text detection and text recognition aspects, enabling users to seamlessly switch between datasets without having to manually adjust for differences in annotation or data structure.
Extensive Dataset Collection: The collection includes a mix of datasets catering to different languages and recognition scenarios. For example, it features data involving English, Chinese, and mixed-language scenarios, ensuring broad applicability across different OCR requirements. Some notable datasets include:
- ICDAR2015 with English language documents.
- MLT2019 which contains a diverse set of multilingual documents.
- COCO-Text_v2 with its mixed language text for evaluation.
- ReCTS and SROIE which contribute to both text detection and specialized document recognition tasks.
Diverse Applications: The datasets cover a wide array of scenarios ranging from natural images to synthetic data. Some datasets, like Synth800k, consist of synthetic images crafted explicitly to train OCR systems effectively, whereas others, like the Baidu Chinese Scene Text Recognition, focus on real-world applications.
Comprehensive Annotations: Each dataset is annotated with detailed information about text regions within images, using coordinates and transcription data. This includes precise demarcations of text within images, character-specific annotations, and even tagging of fuzzy or illegible text areas. For example:
- ICDAR2015 annotations are detailed with coordinates (x1, y1, x2, y2...), specifying the exact location of text.
- LSVT includes annotations that indicate whether a text is legible or not.
Easy Accessibility and Integration: Each dataset is accessible via Baidu Cloud links, allowing users to download and incorporate them swiftly into their projects. A generic link is provided for collective datasets, along with individual access points for specialized data packs, like the Synthetic Chinese String Dataset or the English recognition data pack.

Download and Usage Guidelines:

When downloading these datasets, users must ensure that paths in annotation files are updated according to their local directory structures. This allows seamless integration with custom setups and prevents any issues related to missing files or incorrect directory references.

Dataset Access and Tools:

Data Generation Tool: A handy tool is available for generating synthetic text data, which can be particularly useful for diversifying training data. It includes a well-documented GitHub repository for ease of use.
Reading Scripts: Scripts are provided for reading detection and recognition datasets, facilitating tasks like model training and data preprocessing.

This comprehensive approach equips researchers and developers with reliable resources for building and fine-tuning OCR models, thereby advancing the field of text recognition in natural and digital environments. The OCR_DataSet project stands as a pillar for both academic exploration and practical implementation in the OCR community.