bonito  by BatsResearch

Synthetic data generator for instruction tuning datasets

Created 1 year ago
791 stars

Top 44.4% on SourcePulse

GitHubView on GitHub
Project Summary

Bonito is an open-source library for generating synthetic instruction tuning datasets from unannotated text, eliminating the need for GPT models. It targets researchers and developers looking to create custom datasets for zero-shot task adaptation, offering a lightweight and efficient solution.

How It Works

Bonito leverages Hugging Face Transformers and vLLM for efficient inference. It employs a conditional task generation approach, converting raw text into structured training data for specific NLP tasks. This method allows for the creation of diverse datasets without manual annotation, accelerating the development of instruction-tuned models.

Quick Start & Requirements

Highlighted Details

  • Supports 17 diverse task types including NLI, QA, summarization, and text generation.
  • Built on vLLM for high-throughput, low-latency inference.
  • Offers a quantized version for accessibility on less powerful hardware.
  • Recent model update uses Meta Llama 3.1 as the base.

Maintenance & Community

Licensing & Compatibility

  • Apache 2.0 License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The library is described as "lightweight," suggesting potential limitations in handling extremely large-scale or complex generation tasks compared to more robust frameworks. The quality of synthetic data is dependent on the base model and sampling parameters.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
3 more.

unified-io-2 by allenai

0.3%
626
Unified-IO 2 code for training, inference, and demo
Created 1 year ago
Updated 1 year ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
4 more.

Awesome-pytorch-list by bharathgs

0.1%
16k
Curated list of PyTorch content on GitHub
Created 8 years ago
Updated 1 year ago
Feedback? Help us improve.