Few-NERD: Introducing a Comprehensive Few-shot Named Entity Recognition Dataset
Overview
Few-NERD stands as a large-scale, finely detailed dataset specifically for named entity recognition (NER). It is meticulously annotated, boasting a collection of 188,200 sentences, 491,711 entities, and 4,601,223 tokens. This dataset provides insights into 8 broader categories, broken down further into 66 finer types. Built upon this dataset are three benchmark tasks: a supervised task named Few-NERD (SUP), and two few-shot tasks, Few-NERD (INTRA) and Few-NERD (INTER).
One of the highlights of Few-NERD is its manual annotation, which pays close attention to the context. For instance, in the sentence "London is the fifth album by the British rock band…", the term London
is categorized as Art-Music
, showcasing how context dictates the tagging.
Getting Started
Requirements
To set up Few-NERD on your system, you need to have Python installed. After that, all the necessary dependencies can be quickly installed by executing a simple command:
pip install -r requirements.txt
Few-NERD Dataset
The dataset is divided into different segments to cater to both supervised and few-shot learning settings:
- Supervised: Offers datasets that have been split randomly.
- Few-Shot Inter (INTER): This partition keeps all 8 broader categories intact but varies the finer types.
- Few-Shot Intra (INTRA): Here, the split happens based on the broader categories themselves.
For users' convenience, datasets can be downloaded automatically while running the model, but they can also be downloaded manually using the data/download.sh
script.
Structure
The project is organized in a simple and understandable hierarchy:
- Various utility scripts for framework setup, data loading, and sampling.
- Model scripts like
proto.py
for the prototypical model andnnshot.py
for the NNShot model. - The primary training script is housed in
train_demo.py
.
Key Implementations
- N-way K~2K Shot Sampler: This unique sampling strategy ensures diverse and comprehensive data training and is realized in
util/fewshotsampler.py
. - ProtoBERT, NNShot, and StructShot Models: These models utilize BERT and are implemented across different scripts, focusing on tweaks like a viterbi decoder for enhanced performance in Few-NERD.
Running the Model
Running the models is fairly straightforward with the train_demo.py
script. It is configurable via numerous parameters such as:
- Training mode (inter, intra, or supervised)
- Shot settings (N, K)
- Model choices (proto, nnshot, or structshot)
Here is a sample command for running a 5-way 1-shot setting:
python3 train_demo.py --mode inter --lr 1e-4 --batch_size 8 --trainN 5 --N 5 --K 1 --Q 1 --train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 --max_length 64 --model structshot --tau 0.32
Citation
Researchers using Few-NERD in their work should cite it as follows:
@inproceedings{ding-etal-2021-nerd,
title = "Few-{NERD}: A Few-shot Named Entity Recognition Dataset",
author = "Ding, Ning and
Xu, Guangwei and
Chen, Yulin and
Wang, Xiaobin and
Han, Xu and
Xie, Pengjun and
Zheng, Haitao and
Liu, Zhiyuan",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.248",
doi = "10.18653/v1/2021.acl-long.248",
pages = "3198--3213",
}
Conclusion
Few-NERD sets a new standard in named entity recognition by providing a comprehensive dataset for various learning settings. Its detailed annotations and large volume make it a valuable asset for researchers and developers looking to enhance their NER models. For further inquiries, the creators can be contacted at [email protected] or [email protected].