OpenGPT - Create Specialized Conversational Models for Healthcare Using Datasets

OpenGPT: A Comprehensive Overview

OpenGPT is a versatile framework designed specifically for creating datasets grounded in instructions and training conversational Large Language Models (LLMs) tailored for specific domains. By leveraging OpenGPT, users can develop language models with a focus on domain expertise, making it an ideal tool for industries like healthcare where specialized knowledge is critical.

NHS-LLM: A Healthcare Model

One of the standout applications of OpenGPT is the NHS-LLM, a healthcare-focused conversational model. This model was trained using datasets created with OpenGPT and is designed to enhance medical communication. The datasets used to train NHS-LLM include:

NHS UK Q/A: Contains 24,665 question and answer pairs, generated using the prompt ID f53cf99826 from the NHS UK website.
NHS UK Conversations: Comprises 2,354 unique conversations, created with the prompt ID f4df95ec69 and sourced from the NHS UK website.
Medical Task/Solution Pairs: Consists of 4,688 pairs created using GPT-4, guided by prompt ID 5755564c19.

Installation Guide

To begin using OpenGPT, you can install it through pip:

pip install opengpt

For those working with LLaMA models, additional requirements can be installed with:

pip install -r ./llama_train_requirements.txt

Tutorials and Learning Resources

OpenGPT provides comprehensive tutorials to assist users in creating mini conversational models, especially in the healthcare sector. A notable tutorial can be accessed via Google Colab, titled OpenGPT: The Making of Dum-E.

How to Create and Train a Model

Step 1: Collecting the Base Dataset

The process begins by collecting a foundational dataset within a specific domain. For instance, definitions of diseases can be gathered from trusted sources such as the NHS UK. It is crucial for this dataset to have a text column where each entry represents one definition.

Step 2: Choosing or Creating a Prompt

Next, users need to find or create a relevant prompt. This can be done by searching the prompt database or by utilizing the Prompt Creation Notebook. Once a suitable prompt is identified, users must edit the configuration file for dataset generation and execute the dataset generation process via a notebook.

Step 3: Configuring the Training Setup

After datasets are prepared, the training configuration must be updated. This involves editing the train_config file to include the desired datasets.

Step 4: Training the Model

Finally, using the train notebook or relevant training scripts, users can train their LLMs on the newly created datasets.

Additional Support and Information

For further inquiries or support, users are encouraged to visit Discourse, a platform for community-driven discussions and assistance.

OpenGPT stands as a powerful resource for anyone looking to venture into specialized conversational AI, offering tools and guidance to create robust and domain-specific language models.