bonito - Enhance Instruction Tuning through Conditional Task Generation and Custom Synthetic Datasets

Bonito Project Introduction

Overview

Bonito is an open-source initiative designed to tackle the challenging task of transforming unannotated text into specialized training datasets. This process is crucial for improving instruction tuning, which enables models to adapt to new tasks without specific prior training—also known as zero-shot task adaptation.

Features

Bonito is built upon popular machine learning libraries, namely Hugging Face's transformers and vllm. This foundation allows it to generate synthetic datasets efficiently. Its features include user-friendly interfaces for dataset creation, a rich set of task types, and seamless integration with existing machine learning pipelines.

Key Components

Paper: The foundational paper, titled "Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation," provides an in-depth explanation of the methodologies and is available on arXiv.
Model: Bonito uses a model referred to as bonito-v1, which can be accessed and downloaded from Hugging Face.
Demo: For a practical demonstration of what Bonito can achieve, there is an online demo named "Bonito on Spaces," which allows users to interact with the model and see results first-hand.
Dataset: Bonito leverages the ctga-v1 dataset, hosted on Hugging Face, for its operations.
Code: The codebase used in the research paper is openly accessible for anyone interested in reproducing the reported experiments.

Latest Developments

August 2024 saw the release of a new version of the Bonito model, utilizing the advanced Meta Llama 3.1 as the base model.
In June 2024, Bonito gained recognition at the ACL Findings conference, marking its acceptance as a noteworthy contribution to the field.

Installation

Bonito can be easily installed in a Python environment. Begin by creating a conda environment specific to Bonito, install the required Python version, and then install Bonito itself using pip.

Usage

Once installed, Bonito can be used to generate synthetic datasets by importing the library, initializing the Bonito model, and specifying the datasets and parameters for task generation. Users can define various task types, aiding in the rapid creation of teaching datasets tailored to specific needs.

Supported Task Types

Bonito supports a diverse array of task types, ranging from extractive question answering and multiple-choice questions to sentiment analysis and text generation. Users can specify task types using either their full names or their abbreviated forms.

Tutorials

Bonito offers tutorials to help new users get started. One tutorial demonstrates using a quantized version of the model in Google Colab on a T4 instance. Another guides users through running Bonito on an A100 GPU, also via Google Colab.

Citation

Researchers using Bonito are encouraged to cite the original paper in their work to acknowledge the contribution of the developers and researchers behind Bonito.

By offering these comprehensive capabilities as part of its open-source library, Bonito stands out as an invaluable tool for researchers and developers aiming to streamline the creation of instruction-tuned datasets.