Datasets for Entity Recognition
Overview
The "entity-recognition-datasets" project is a comprehensive resource for researchers and developers working on Entity Recognition and Named Entity Recognition (NER) tasks. This repository collects datasets from various domains, each annotated with a multitude of entity types, and is especially useful for NER tasks. As of 2020, the repository stopped actively adding new datasets; however, it welcomes contributions via issues or pull requests.
Datasets for NER in English
The repository catalogues an extensive list of datasets specifically for English-language NER, each detailed with domain information, licensing, references, and availability. Here's a glimpse of what's included:
- CONLL 2003: A news domain dataset, widely regarded as a classic in NER research.
- MUC-6 and OntoNotes 5: These datasets cover various domains and are known for their extensive annotations.
- Ritter and BTC: Focused on social media platforms like Twitter, capturing unique challenges in NER.
- i2b2 series and CADEC: These datasets delve into the medical domain, offering crucial insights into medical text processing.
- MITRestaurant and MITMovie: Designed for specific query use cases like restaurant and movie-related questions.
Each dataset is accompanied by links and formats, such as the CoNLL 2003 format, crucial for researchers to easily access and process the data.
Datasets for NER in Other Languages
The repository goes beyond English, offering resources for multiple languages and dialects. This is essential for multilingual NLP tasks and research. Highlights include:
- German: Incorporates datasets like CoNLL 2003 and GermEval 2014, key for understanding NER in German texts.
- Dutch and Spanish: Includes the CoNLL 2002 dataset, covering news articles and other structured texts.
- Portuguese and French: Offers datasets like HAREM and ESTER, aimed at diverse text types from literature to legal documents.
Special Categories and Multilingual Datasets
Some creative collections cater to more specific scenarios:
- Code-Switching Datasets: For understanding NER across language blends such as English-Spanish tweets.
- Historical and Fictional Texts: Projects like LitBank that annotate literary entities and historical documents.
- Multilingual Corpora: Include WikiNER and DAWT, offering coverage across multiple languages from Wikipedia sources.
Licensing Information
Each dataset's licensing details are crucial for legal usage and distribution. This repository carefully notes each dataset's license, ranging from public domain to more restricted licenses like LDC (Linguistic Data Consortium). This transparency ensures users are well-informed about the terms of use.
Contribution and Expansion
Though the direct addition of new datasets ceased in 2020, there is an invitation for community contributions. Researchers and practitioners can suggest new datasets or improve existing ones, fostering a collaborative approach to enhancing this NER resource.
Conclusion
The "entity-recognition-datasets" project stands as a robust foundation for anyone diving into the world of NER. It bridges various domains, languages, and specific use cases, facilitating a deeper understanding and advancement in entity recognition tasks. By maintaining a living document through community collaboration, it continues to serve as a valuable asset in the field of Natural Language Processing.