tessdata - Varied Data Sets for Tesseract.js to Enhance OCR Efficiency

Overview of Tessdata Project

The Tessdata project is a repository that hosts a collection of .traineddata files designed for use with Tesseract.js, an OCR (Optical Character Recognition) library. These trained data files include both the default files used by Tesseract.js and alternative versions that cater to different OCR needs and preferences.

Language Data

Within the Tessdata repository, various sets of files are available, each offering distinct features, versions, and technical benefits. Here's a summary of these file sets:

4.0.0_best_int: This is the integerized version of the "Tessdata Best" optimized for LSTM (Long Short-Term Memory) only. It's the default choice when using Tesseract.js with OEM (Original Equipment Manufacturer) set to LSTM. This version is available as an NPM package for convenient integration into Node.js projects.
4.0.0: Known simply as "Tessdata," this version includes integerized LSTM data along with data for legacy usage (OEM: LSTM + Legacy). It's also default when OEM is set to legacy or LSTM with a legacy fallback. This version is likewise published to NPM.
4.0.0-fast: This variant, optimized for speed with LSTM only, is not the default choice for Tesseract.js and is not published on NPM. Data for this version can be found at the tessdata_fast repository.
4.0.0_best: The "Tessdata Best" version, designed for situations demanding possibly higher accuracy at the cost of larger file sizes and longer runtime. It is not used by default with Tesseract.js and requires careful consideration before use due to its substantial size. Further information is available at the tessdata_best repository.
3.0.2: Legacy data from Tesseract version 3 which uses OEM as Legacy only. These older files are not default and might be phased out of the repo. They are not published to NPM.

NPM Packages

To facilitate easier access and use, the 4.0.0 and 4.0.0_best_int files are organized into language-specific NPM packages. This means each language has its own package, ensuring the package size remains manageable. For instance, the English language package can be found under @tesseract.js-data/eng.

Using Language Data with Tesseract.js

To integrate these language data files with Tesseract.js, users should follow the Tesseract.js documentation to manually configure the langPath. Here are several options for accessing these files:

CDNs (Content Delivery Networks):
- JSDelivr: This is the default CDN used by Tesseract.js. An example link to the English language data is here.
- Unpkg: Another option similar to JSDelivr but noted for better accessibility in China. Users can switch from JSDelivr to Unpkg if needed, with discussions and examples provided in the Tesseract.js issue tracker. Here's a link to the English data on Unpkg: link.
Local Copy: Users can opt to maintain a local copy of these files to avoid reliance on remote servers. In Node.js, relevant NPM packages can be added as dependencies, or files can be manually downloaded and hosted locally for web applications.
GitHub Pages (Deprecated): Previously, the default langPath was hosted on a GitHub Pages site, which is now outdated and unreliable due to size limits. New implementations should not reference this deprecated site and should transition to newer methods of file access.