h2o-wizardlm - Enhance LLM Fine-Tuning by Converting Documents into High-Complexity Q:A Pairs

Introduction to h2o-wizardlm

The h2o-wizardlm is a remarkable open-source project designed for transforming documents into question-and-answer pairs that are used for fine-tuning large language models (LLMs). This project focuses on creating complex instructions automatically by leveraging existing instruction-tuned LLM models, ideally pushing towards creating truly open ChatGPT-like models. Importantly, it operates without infringing upon terms of service like those of Vicuna or ShareGPT, as it builds on models and data licensed under Apache 2.0.

Key Features

Input Options: The system requires an instruction-tuned LLM and can utilize optional seed prompts. In the future, it plans to accommodate entire document corpuses.
Output: It generates a set of sophisticated instruction prompts along with corresponding responses.

The project is inspired by academic research, detailed in this paper.

How It Works

The h2o-wizardlm project takes a starting or seed prompt, such as "What's trending in science & technology?" It then auto-generates complex prompts. One example could involve a scenario where an AI researcher must craft an extensive study on AI in healthcare, using academic sources, conducting critical evaluations, and proposing informed recommendations.

Getting Started

To use the h2o-wizardlm, users must set up a Python 3.10 environment and install necessary dependencies with the following command:

pip install -r requirements.txt

Next, users can create a WizardLM dataset by modifying the base model and the desired number of rows in the wizardlm.py file, and then executing:

python wizardlm.py

This process generates a file named wizard_lm.uuid.json, where "uuid" is a randomly generated string. Sample output files are available for reference.

Current Limitations

The system can be slow, even when using pipelines and batching.
It relies on a well-tuned instruction LLM for optimal prompt generation.
While the prompt generation is proficient, the associated responses might occasionally be empty, particularly with the model junelee/wizard-vicuna-13b.

Future Developments

The h2o-wizardlm team aims to enhance the project by:

Increasing the speed of the system.
Improving the quality of generated responses.
Introducing complexity control for prompts.
Developing capabilities to handle input in addition to instructions, supporting tasks like summarization or code generation.
Employing a complete Apache 2.0 workflow, integrating with platforms like Open LLaMa and oasst1 to continuously refine the system.

In summary, h2o-wizardlm is an innovative tool aimed at advancing the field of LLM fine-tuning by enabling the creation of high-complexity instructional content while adhering to open-source principles.