Introduction to nlg-eval
The nlg-eval
project offers a comprehensive solution for evaluating natural language generated by machines. It simplifies the process by providing a range of automated metrics that help determine the quality of machine-generated text. This is essential in fields like dialogue systems, chatbots, and machine translation, where assessing the output of natural language generation (NLG) systems is crucial. nlg-eval
accepts a hypothesis file containing generated sentences and one or more reference files with the correct sentences, comparing them to compute various evaluation metrics.
Supported Metrics
nlg-eval
supports a variety of metrics:
- BLEU: Measures n-gram matching between the hypothesis and references.
- METEOR: Considers synonyms and stemming in its evaluation, making it more flexible.
- ROUGE: Commonly used for summarization, checking overlap of subsequences.
- CIDEr: Considers consensus within the reference set and importance of each term.
- SPICE: Focuses on semantic comparisons rather than surface forms.
- Cosine Similarity Approaches: Includes SkipThought, Embedding Average, and Vector Extrema that compare semantic similarity through vector space representations.
- Greedy Matching Score: Measures semantic overlaps by greedily matching words between hypothesis and references.
Setup Instructions
Setting up nlg-eval
is straightforward:
- Ensure Java 1.8.0+ is installed.
- Install the required Python packages using pip:
pip install git+https://github.com/Maluuba/nlg-eval.git@master
- For macOS High Sierra or later, enable multithreading:
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
- Run the setup command to download essential data and models:
nlg-eval --setup
For a custom setup, especially if you want to specify where the data is stored, use the nlg-eval --setup ${data_path}
command. Adjust the environment variable NLGEVAL_DATA
if needed.
Validating and Testing
After setup, ensure all data files are downloaded correctly by checking their hashes or running pytest
to confirm that everything is functioning properly.
Usage
nlg-eval
offers both command-line and Python API interfaces:
-
Standalone Command Line: Run evaluations by providing the hypothesis and reference files:
nlg-eval --hypothesis=examples/hyp.txt --references=examples/ref1.txt --references=examples/ref2.txt
-
Python API: Use the functional API for a single sentence or corpus, or employ the object-oriented API for repeated calls within a script. This flexibility accommodates both simple and advanced use cases.
Practical Considerations
The CIDEr metric's behavior changes with the idf
parameter. For a more stable evaluation with limited examples, adjust this setting per the guidelines mentioned in the project documentation.
Licensing and Conduct
The project complies with the Microsoft Open Source Code of Conduct, ensuring a welcoming and inclusive environment for all contributors.
In summary, nlg-eval
is a robust tool for evaluating the quality of natural language generation, making it indispensable for researchers and developers working with machine-generated text. Its diverse metrics and user-friendly setup make it accessible for tackling various evaluation challenges in NLG projects.