datasetGPT  by radi-cho

CLI tool for generating textual/conversational datasets using LLMs

created 2 years ago
301 stars

Top 89.6% on sourcepulse

GitHubView on GitHub
Project Summary

datasetGPT is a command-line interface and Python library for generating textual and conversational datasets using Large Language Models (LLMs). It targets researchers and developers needing to create training data for AI content detectors, analyze LLM behavior, or automate LLM-driven text generation tasks. The tool simplifies the process of querying multiple LLM backends with varying parameters to produce diverse datasets.

How It Works

The tool leverages a flexible backend system, supporting APIs from OpenAI (GPT-3, GPT-4), Cohere, and Petals. Users define prompts and specify LLM parameters like temperature and max length. For conversational datasets, it simulates dialogues between two configurable agents, allowing for control over agent roles, conversation length, and interruption strategies (fixed length or end phrase). The CLI and Python library generate datasets by systematically exploring combinations of specified parameters and backends.

Quick Start & Requirements

  • Install via pip: pip install datasetGPT
  • Additional packages for specific backends: pip install openai cohere petals
  • Requires API keys for services like OpenAI and Cohere.
  • See CLI reference and Python examples in the README for detailed usage.

Highlighted Details

  • Supports generation of both raw text and multi-turn conversations.
  • Allows parallel inference across multiple LLM backends (OpenAI, Cohere, Petals).
  • Enables systematic exploration of parameter space (temperature, length, custom options) for dataset diversity.
  • Outputs datasets in structured JSON format, suitable for further processing.

Maintenance & Community

The project is under active development, with contributions welcomed. Planned features include dataset transformations and support for more LLM backends.

Licensing & Compatibility

The tool itself is distributed freely without downstream use restrictions. However, users must adhere to the terms of service of the backend LLM APIs they utilize.

Limitations & Caveats

The project is still under active development, with features like dataset transformations not yet implemented. Support for specific LLM models is dependent on the availability and API access provided by the respective LLM providers.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

1.6%
1k
Synthetic data CLI tool for LLM fine-tuning
created 4 months ago
updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

LightRAG by HKUDS

1.0%
19k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 1 day ago
Feedback? Help us improve.