lingua-go - Comprehensive Language Detection Solution for NLP Tasks

Introduction to Lingua-go

Lingua-go is a language detection library that stands out for its simplicity and efficiency in identifying the language of a given text. This functionality can be crucial in pre-processing linguistic data for various applications in natural language processing (NLP), such as text classification and spell-checking. It can also redirect emails to geographically appropriate customer service departments, depending on the language used in the emails.

Purpose of Lingua-go

Lingua-go was created to address specific limitations found in other language detection solutions. Often, language identification is a part of extensive machine learning frameworks or NLP applications. However, not every use case requires complex systems. For those instances where a simpler, more flexible solution is needed, Lingua-go is designed to be easy to use without needing extensive configurations or connections to external services.

Among its main competitors is the open-source library Whatlanggo in the Go ecosystem. Whatlanggo has two notable shortcomings:

It requires lengthy text for accurate language detection, making it less suitable for short text segments like tweets.
Its accuracy decreases proportionally to the number of languages involved in the detection process.

Lingua-go addresses these issues by providing accurate results for both short and long texts. It employs a combination of rule-based and statistical methods without relying on word dictionaries. Moreover, once downloaded, Lingua-go can operate entirely offline.

Supported Languages

Lingua-go supports a carefully selected set of 75 languages, focusing on accuracy over sheer quantity. This approach ensures high accuracy rates for each supported language before expanding the list further. Some of the supported languages include:

Afrikaans
Arabic
Chinese
English
French
German
Hindi
Japanese
Russian
Spanish

The complete list includes a diverse array of languages, each subjected to thorough testing to ensure reliable detection.

Accuracy and Performance

Lingua-go's effectiveness is demonstrated through accuracy statistics derived from test data for each supported language. This data is categorized into three sections:

Single words with at least five characters.
Word pairs with a minimum of ten characters.
Complete grammatical sentences of varying lengths.

The language models and test data are created using the Wortschatz corpora from Leipzig University, Germany. Training involves data collected from news websites, with each corpus containing one million sentences. Testing uses data from various websites, with each test corpus comprising ten thousand sentences. From these, samples of 1,000 single words, 1,000 word pairs, and 1,000 sentences are randomly extracted.

In extensive comparative tests with Whatlanggo and Google's CLD3 (accessed via the gocld3 bindings), Lingua-go exhibits top-tier accuracy. Visual data from these tests include bar and box plots that present detailed accuracy results per language. These plots represent the distribution of accuracy values, with the box plot displaying the statistical median of the results.

Summary of Performance

Lingua-go's performance is consistently high, with robust accuracy across single words, word pairs, and sentences. Each language is carefully benchmarked, and the results are presented with detailed statistical metrics, including mean, median, and standard deviation, to provide a comprehensive view of its detection capabilities.

In summary, Lingua-go is a powerful tool designed to overcome typical challenges in language detection, offering simplicity, flexibility, and high accuracy in determining the linguistic composition of texts. Its offline capability and extensive language support make it an attractive choice for developers seeking efficient language processing solutions in the Go programming ecosystem.