Introduction to h2o-wizardlm
The h2o-wizardlm is a remarkable open-source project designed for transforming documents into question-and-answer pairs that are used for fine-tuning large language models (LLMs). This project focuses on creating complex instructions automatically by leveraging existing instruction-tuned LLM models, ideally pushing towards creating truly open ChatGPT-like models. Importantly, it operates without infringing upon terms of service like those of Vicuna or ShareGPT, as it builds on models and data licensed under Apache 2.0.
Key Features
- Input Options: The system requires an instruction-tuned LLM and can utilize optional seed prompts. In the future, it plans to accommodate entire document corpuses.
- Output: It generates a set of sophisticated instruction prompts along with corresponding responses.
The project is inspired by academic research, detailed in this paper.
How It Works
The h2o-wizardlm project takes a starting or seed prompt, such as "What's trending in science & technology?" It then auto-generates complex prompts. One example could involve a scenario where an AI researcher must craft an extensive study on AI in healthcare, using academic sources, conducting critical evaluations, and proposing informed recommendations.
Getting Started
To use the h2o-wizardlm, users must set up a Python 3.10 environment and install necessary dependencies with the following command:
pip install -r requirements.txt
Next, users can create a WizardLM dataset by modifying the base model and the desired number of rows in the wizardlm.py
file, and then executing:
python wizardlm.py
This process generates a file named wizard_lm.uuid.json
, where "uuid" is a randomly generated string. Sample output files are available for reference.
Current Limitations
- The system can be slow, even when using pipelines and batching.
- It relies on a well-tuned instruction LLM for optimal prompt generation.
- While the prompt generation is proficient, the associated responses might occasionally be empty, particularly with the model junelee/wizard-vicuna-13b.
Future Developments
The h2o-wizardlm team aims to enhance the project by:
- Increasing the speed of the system.
- Improving the quality of generated responses.
- Introducing complexity control for prompts.
- Developing capabilities to handle input in addition to instructions, supporting tasks like summarization or code generation.
- Employing a complete Apache 2.0 workflow, integrating with platforms like Open LLaMa and oasst1 to continuously refine the system.
In summary, h2o-wizardlm is an innovative tool aimed at advancing the field of LLM fine-tuning by enabling the creation of high-complexity instructional content while adhering to open-source principles.