bonito  by BatsResearch

Synthetic data generator for instruction tuning datasets

created 1 year ago
783 stars

Top 45.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Bonito is an open-source library for generating synthetic instruction tuning datasets from unannotated text, eliminating the need for GPT models. It targets researchers and developers looking to create custom datasets for zero-shot task adaptation, offering a lightweight and efficient solution.

How It Works

Bonito leverages Hugging Face Transformers and vLLM for efficient inference. It employs a conditional task generation approach, converting raw text into structured training data for specific NLP tasks. This method allows for the creation of diverse datasets without manual annotation, accelerating the development of instruction-tuned models.

Quick Start & Requirements

Highlighted Details

  • Supports 17 diverse task types including NLI, QA, summarization, and text generation.
  • Built on vLLM for high-throughput, low-latency inference.
  • Offers a quantized version for accessibility on less powerful hardware.
  • Recent model update uses Meta Llama 3.1 as the base.

Maintenance & Community

Licensing & Compatibility

  • Apache 2.0 License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The library is described as "lightweight," suggesting potential limitations in handling extremely large-scale or complex generation tasks compared to more robust frameworks. The quality of synthetic data is dependent on the base model and sampling parameters.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Starred by Jeremy Howard Jeremy Howard(Cofounder of fast.ai) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

SwissArmyTransformer by THUDM

0.3%
1k
Transformer library for flexible model development
created 3 years ago
updated 7 months ago
Feedback? Help us improve.