prodigy-openai-recipes - Streamline Data Annotation for NER and Text Categorization with OpenAI and Prodigy

Introduction to Prodigy-OpenAI-Recipes

Prodigy-OpenAI-Recipes is an innovative project that provides a set of example codes to demonstrate how to efficiently create high-quality datasets by combining zero-shot and few-shot learning with minimal annotation efforts. The project primarily utilizes large language models (LLMs) from OpenAI to generate initial predictions, and then employs Prodigy, a data annotation tool, to refine and curate these predictions. This streamlined process enables users to quickly compile a gold-standard dataset which can be used to train customized supervised models tailored to specific needs or use-cases.

Project Background

The Prodigy-OpenAI-Recipes repository is no longer actively maintained as the recipes have been integrated into the Prodigy toolset. Plans are underway for future upgrades in conjunction with spacy-llm, which will enhance the prompts and offer support from multiple LLM providers. The focus has thus shifted to maintaining the recipes as part of spaCy and Prodigy frameworks.

Setup and Installation

To utilize these recipes, users need to install Prodigy and some additional Python dependencies. This setup includes generating an API key from OpenAI to facilitate interaction with the LLMs. Critical environment variables such as the organization key and secret key must be set, usually in a .env file, to seamlessly access OpenAI's services.

Named-Entity Recognition (NER)

The NER section of the project is designed to leverage LLMs to automatically predict entities in a text which can then be verified or adjusted through Prodigy. The recipe ner.openai.correct specifically deals with entity prediction using GPT-3 and allows for manual correction, ensuring that a top-notch dataset is formed through minimal human intervention. Users can specify various options to control language, input segmentation, model type, and batch size, among other parameters.

Interactive Tuning with NER

Users can iteratively refine the predictions from the LLM by flagging incorrect examples within the Prodigy UI, which will then be adjusted in future predictions. This interactive process enriches the quality and precision of the generated dataset, ensuring that any systematic errors can be addressed and rectified efficiently.

Text Categorization (Textcat)

The text categorization aspect permits quick classification of text into defined categories with the help of an LLM, along with explanations for each chosen label. This is achieved through the textcat.openai.correct recipe and supports both binary and multi-class categorization, using prompts that guide the language model's prediction process.

Textcat Interactive Tuning

Like in the NER recipes, users can influence model predictions by selecting and flagging erroneous outcomes. This real-time adjustment helps steer the predictions in the right direction based on user feedback and can be fine-tuned by changing batch sizes to optimize the workflow speed.

Fetching Examples Up-Front

The project also provides functionality for batch fetching examples through the recipes ner.openai.fetch and textcat.openai.fetch. This is particularly useful for datasets with imbalances, as it allows users to extract rare but important examples without going through the entire dataset manually.

Exporting and Training Models

Once the dataset is curated, users can export the annotations for further use. They can then train a text categorization model or an NER model using spaCy, a popular NLP library, delivering targeted solutions to specific NLP challenges. The export can even be adjusted to spaCy’s binary format to ease the integration and training with spaCy or Prodigy’s training functionalities.

In summary, Prodigy-OpenAI-Recipes offers a practical approach to creating high-quality datasets using LLMs, helping to make the model training process faster and more efficient with limited manual input. This project underscores the synergy between advanced language models and data annotation tools like Prodigy to yield swift, accurate results for diverse NLP applications.