nlg-eval - Comprehensive Overview of Unsupervised NLG Metric Evaluations

Introduction to nlg-eval

The nlg-eval project offers a comprehensive solution for evaluating natural language generated by machines. It simplifies the process by providing a range of automated metrics that help determine the quality of machine-generated text. This is essential in fields like dialogue systems, chatbots, and machine translation, where assessing the output of natural language generation (NLG) systems is crucial. nlg-eval accepts a hypothesis file containing generated sentences and one or more reference files with the correct sentences, comparing them to compute various evaluation metrics.

Supported Metrics

nlg-eval supports a variety of metrics:

BLEU: Measures n-gram matching between the hypothesis and references.
METEOR: Considers synonyms and stemming in its evaluation, making it more flexible.
ROUGE: Commonly used for summarization, checking overlap of subsequences.
CIDEr: Considers consensus within the reference set and importance of each term.
SPICE: Focuses on semantic comparisons rather than surface forms.
Cosine Similarity Approaches: Includes SkipThought, Embedding Average, and Vector Extrema that compare semantic similarity through vector space representations.
Greedy Matching Score: Measures semantic overlaps by greedily matching words between hypothesis and references.

Setup Instructions

Setting up nlg-eval is straightforward:

Ensure Java 1.8.0+ is installed.

Install the required Python packages using pip:

pip install git+https://github.com/Maluuba/nlg-eval.git@master

For macOS High Sierra or later, enable multithreading:
```
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
```
Run the setup command to download essential data and models:
```
nlg-eval --setup
```

For a custom setup, especially if you want to specify where the data is stored, use the nlg-eval --setup ${data_path} command. Adjust the environment variable NLGEVAL_DATA if needed.

Validating and Testing

After setup, ensure all data files are downloaded correctly by checking their hashes or running pytest to confirm that everything is functioning properly.

Usage

nlg-eval offers both command-line and Python API interfaces:

Standalone Command Line: Run evaluations by providing the hypothesis and reference files:

nlg-eval --hypothesis=examples/hyp.txt --references=examples/ref1.txt --references=examples/ref2.txt

Python API: Use the functional API for a single sentence or corpus, or employ the object-oriented API for repeated calls within a script. This flexibility accommodates both simple and advanced use cases.

Practical Considerations

The CIDEr metric's behavior changes with the idf parameter. For a more stable evaluation with limited examples, adjust this setting per the guidelines mentioned in the project documentation.

Licensing and Conduct

The project complies with the Microsoft Open Source Code of Conduct, ensuring a welcoming and inclusive environment for all contributors.

In summary, nlg-eval is a robust tool for evaluating the quality of natural language generation, making it indispensable for researchers and developers working with machine-generated text. Its diverse metrics and user-friendly setup make it accessible for tackling various evaluation challenges in NLG projects.