bonito by BatsResearch

Synthetic data generator for instruction tuning datasets

Created 2 years ago

823 stars

Top 43.0% on SourcePulse

View on GitHub

3 Experts Love This Project

Maxime Labonne

Head of Post-Training at Liquid AI

Omar Sanseviero

DevRel at Google DeepMind

Casper Hansen

Author of AutoAWQ

Project Summary

Bonito is an open-source library for generating synthetic instruction tuning datasets from unannotated text, eliminating the need for GPT models. It targets researchers and developers looking to create custom datasets for zero-shot task adaptation, offering a lightweight and efficient solution.

How It Works

Bonito leverages Hugging Face Transformers and vLLM for efficient inference. It employs a conditional task generation approach, converting raw text into structured training data for specific NLP tasks. This method allows for the creation of diverse datasets without manual annotation, accelerating the development of instruction-tuned models.

Quick Start & Requirements

Install via pip: pip3 install bonito-llm
Requires Python. GPU with CUDA is recommended for performance.
Demo: Bonito on Spaces
Tutorial: Quantized Model on Colab, A100 GPU Tutorial

Highlighted Details

Supports 17 diverse task types including NLI, QA, summarization, and text generation.
Built on vLLM for high-throughput, low-latency inference.
Offers a quantized version for accessibility on less powerful hardware.
Recent model update uses Meta Llama 3.1 as the base.

Maintenance & Community

Accepted to ACL Findings 2024.
Active development with recent model updates.
Project paper available: Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation

Licensing & Compatibility

Apache 2.0 License.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The library is described as "lightweight," suggesting potential limitations in handling extremely large-scale or complex generation tasks compared to more robust frameworks. The quality of synthetic data is dependent on the base model and sampling parameters.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days