Few-NERD - Comprehensive Dataset for Named Entity Recognition Tasks

Few-NERD: Introducing a Comprehensive Few-shot Named Entity Recognition Dataset

Overview

Few-NERD stands as a large-scale, finely detailed dataset specifically for named entity recognition (NER). It is meticulously annotated, boasting a collection of 188,200 sentences, 491,711 entities, and 4,601,223 tokens. This dataset provides insights into 8 broader categories, broken down further into 66 finer types. Built upon this dataset are three benchmark tasks: a supervised task named Few-NERD (SUP), and two few-shot tasks, Few-NERD (INTRA) and Few-NERD (INTER).

One of the highlights of Few-NERD is its manual annotation, which pays close attention to the context. For instance, in the sentence "London is the fifth album by the British rock band…", the term London is categorized as Art-Music, showcasing how context dictates the tagging.

Getting Started

Requirements

To set up Few-NERD on your system, you need to have Python installed. After that, all the necessary dependencies can be quickly installed by executing a simple command:

pip install -r requirements.txt

Few-NERD Dataset

The dataset is divided into different segments to cater to both supervised and few-shot learning settings:

Supervised: Offers datasets that have been split randomly.
Few-Shot Inter (INTER): This partition keeps all 8 broader categories intact but varies the finer types.
Few-Shot Intra (INTRA): Here, the split happens based on the broader categories themselves.

For users' convenience, datasets can be downloaded automatically while running the model, but they can also be downloaded manually using the data/download.sh script.

Structure

The project is organized in a simple and understandable hierarchy:

Various utility scripts for framework setup, data loading, and sampling.
Model scripts like proto.py for the prototypical model and nnshot.py for the NNShot model.
The primary training script is housed in train_demo.py.

Key Implementations

N-way K~2K Shot Sampler: This unique sampling strategy ensures diverse and comprehensive data training and is realized in util/fewshotsampler.py.
ProtoBERT, NNShot, and StructShot Models: These models utilize BERT and are implemented across different scripts, focusing on tweaks like a viterbi decoder for enhanced performance in Few-NERD.

Running the Model

Running the models is fairly straightforward with the train_demo.py script. It is configurable via numerous parameters such as:

Training mode (inter, intra, or supervised)
Shot settings (N, K)
Model choices (proto, nnshot, or structshot)

Here is a sample command for running a 5-way 1-shot setting:

python3 train_demo.py --mode inter --lr 1e-4 --batch_size 8 --trainN 5 --N 5 --K 1 --Q 1 --train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 --max_length 64 --model structshot --tau 0.32

Citation

Researchers using Few-NERD in their work should cite it as follows:

@inproceedings{ding-etal-2021-nerd,
    title = "Few-{NERD}: A Few-shot Named Entity Recognition Dataset",
    author = "Ding, Ning  and
      Xu, Guangwei  and
      Chen, Yulin  and
      Wang, Xiaobin  and
      Han, Xu  and
      Xie, Pengjun  and
      Zheng, Haitao  and
      Liu, Zhiyuan",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.248",
    doi = "10.18653/v1/2021.acl-long.248",
    pages = "3198--3213",
}

Conclusion

Few-NERD sets a new standard in named entity recognition by providing a comprehensive dataset for various learning settings. Its detailed annotations and large volume make it a valuable asset for researchers and developers looking to enhance their NER models. For further inquiries, the creators can be contacted at [email protected] or [email protected].