Project Introduction to Lingua-rs
What is Lingua-rs?
Lingua-rs is a language detection library with a straightforward mission: to identify the language of a given piece of text. This capability is particularly valuable in the context of natural language processing applications, where understanding language can be vital for tasks such as text classification and spell checking. In addition to these, language detection can be used to streamline operations like routing emails to customer service departments in the correct geographical location, based on the language of the email.
Why Does Lingua-rs Exist?
Generally, language detection is a feature found within larger machine learning or natural language processing frameworks. However, not everyone needs all the bells and whistles of these comprehensive systems, or they might not want the steep learning curve that comes with them. This is the niche where Lingua-rs fits perfectly—offering robust language detection capabilities in a compact and flexible package.
While there are other language detection libraries available in the Rust ecosystem, like CLD2, Whatlang, and Whichlang, many of them struggle with detecting languages in short snippets of text, such as tweets, and their accuracy declines as more languages are added to the mix. Lingua-rs addresses these issues, delivering precise results regardless of text length—even if it's just a word or a phrase—without requiring complex configurations or an internet connection. Once downloaded, it functions entirely offline, making it a handy tool in varied conditions.
Supported Languages
Lingua-rs adopts a philosophy of "quality over quantity," emphasizing high accuracy in a smaller set of languages before expanding coverage. Currently, it can detect 75 different languages, ranging from Afrikaans and Arabic to Yoruba and Zulu, with a wide range of languages in between including widely-used languages such as English, French, Chinese, and Spanish.
How Accurate is Lingua-rs?
Accuracy is a critical aspect of any language detection library, and Lingua-rs shines in this department. It comes with test data for each supported language, segmented into single words, word pairs, and complete sentences. The language models and test data originate from the Wortschatz corpora provided by Leipzig University. Training data includes extensive sets of news articles, while testing data comes from various website contents, ensuring diverse input scenarios.
Lingua-rs consistently outperforms its peers—CLD2, Whatlang, and Whichlang—in accuracy across different text measures including single words, word pairs, and full sentences. Charts and plots illustrate that Lingua-rs is the most precise among them, especially when comparing a common subset of languages supported by all these libraries. Whether considering all 75 supported languages or the 16 languages common to these libraries, Lingua-rs stands out as a reliable and accurate tool for language detection.
Detection Performance
- Single Word Detection: Lingua-rs outperforms in detecting languages in the shortest text segments, an area where many libraries struggle.
- Word Pair Detection: It maintains high accuracy even when identifying languages through two-word combinations.
- Sentence Detection: Lingua-rs delivers robust performance in determining the language of complete sentences.
Each category showcases Lingua-rs's capability to handle a broad range of text lengths and its overall superior performance, making it a leading choice for developers in need of a dependable language detection library.