bonito  by BatsResearch

Synthetic data generator for instruction tuning datasets

Created 1 year ago
816 stars

Top 43.4% on SourcePulse

GitHubView on GitHub
Project Summary

Bonito is an open-source library for generating synthetic instruction tuning datasets from unannotated text, eliminating the need for GPT models. It targets researchers and developers looking to create custom datasets for zero-shot task adaptation, offering a lightweight and efficient solution.

How It Works

Bonito leverages Hugging Face Transformers and vLLM for efficient inference. It employs a conditional task generation approach, converting raw text into structured training data for specific NLP tasks. This method allows for the creation of diverse datasets without manual annotation, accelerating the development of instruction-tuned models.

Quick Start & Requirements

Highlighted Details

  • Supports 17 diverse task types including NLI, QA, summarization, and text generation.
  • Built on vLLM for high-throughput, low-latency inference.
  • Offers a quantized version for accessibility on less powerful hardware.
  • Recent model update uses Meta Llama 3.1 as the base.

Maintenance & Community

Licensing & Compatibility

  • Apache 2.0 License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The library is described as "lightweight," suggesting potential limitations in handling extremely large-scale or complex generation tasks compared to more robust frameworks. The quality of synthetic data is dependent on the base model and sampling parameters.

Health Check
Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
3 more.

unified-io-2 by allenai

0.8%
641
Unified-IO 2 code for training, inference, and demo
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.