sage - Comprehensive Solution for Multi-Language Spelling Correction and Evaluation

Introduction to the SAGE Project

SAGE, an acronym for Spell Checking via Augmentation and Generative distribution Emulation, is a comprehensive solution for addressing spelling errors across multiple languages. This project harnesses the capabilities of advanced machine learning models and powerful data augmentation techniques to offer accurate spell checking and error emulation.

Features of SAGE

1. Spelling Correction Models SAGE provides state-of-the-art pre-trained transformer models specifically designed for spelling correction. These models include:

sage-fredt5-large
sage-fredt5-distilled-95m
sage-mt5-large
sage-m2m100-1.2B
Additional earlier releases such as T5-large and FRED-T5-large

These models are readily available for testing and can be explored through interactive demos provided by SAGE.

2. Spelling Corruption and Augmentation To enhance the training and performance of spelling correction tools, SAGE uses spelling corruption techniques. The two main approaches are:

Statistic-based Spelling Corruption (SBSC): This method mimics human error patterns by analyzing and recreating errors. It relies on statistical data from naturally occurring errors.
Augmentex: This approach uses rule-based methods and common error patterns to introduce realistic mistakes into text. It operates on both word-level and character-level granularity to simulate typing errors.

3. Evaluation Tools SAGE offers robust evaluation mechanisms to assess the performance and accuracy of spelling correction tools. This involves testing models against diverse datasets containing real-world and synthetic errors.

Installation and Usage

To begin using SAGE, simply clone the project's repository and install it using Python package manager (pip). Detailed instructions for both regular and editable installations are provided, making it easy for users to set up the project according to their preferences.

Demonstrations and Use Cases

SAGE provides quick demos to illustrate how text can be altered with errors using SBSC and Augmentex. It further demonstrates how these errors are corrected using the provided models. Users can simulate corruptions on texts across various languages, providing a practical explanation of SAGE's capabilities.

Evaluation and Benchmarking

The SAGE project includes comprehensive evaluation protocols to validate the performance of its models on open benchmarks. This allows developers to measure the efficiency of spelling correction under realistic conditions, ensuring that the tools are both effective and reliable.

Recent Updates

SAGE continues to evolve with the latest updates, including:

Acceptance of the SAGE paper at the EACL 2024 conference.
Release of SAGE v1.1.0, with detailed release notes available for users to explore new features and improvements.

Spelling Correction Methodology

SAGE's methodology emphasizes the use of large parallel corpora with synthetically generated errors for pre-training, supplemented by fine-tuning on datasets containing natural human-made errors. This two-step approach maximizes the accuracy and robustness of the spelling correction models across different languages and domains.

Supported Languages

Currently, SAGE supports models and datasets primarily for Russian and English languages. The models have been tested and validated on a mixture of datasets, including:

For Russian: RUSpellRU, MultidomainGold, MedSpellChecker, GitHubTypoCorpusRu
For English: BEA60K, JFLEG

These datasets cover diverse domains ranging from social media and literary works to medical texts and GitHub commit messages, offering a comprehensive testing ground for the models.

In conclusion, SAGE stands as a versatile tool that integrates cutting-edge machine learning with an innovative approach to spelling error correction and emulation, catering to a wide range of linguistic applications.