datasetGPT  by radi-cho

CLI tool for generating textual/conversational datasets using LLMs

Created 2 years ago
299 stars

Top 89.0% on SourcePulse

GitHubView on GitHub
Project Summary

datasetGPT is a command-line interface and Python library for generating textual and conversational datasets using Large Language Models (LLMs). It targets researchers and developers needing to create training data for AI content detectors, analyze LLM behavior, or automate LLM-driven text generation tasks. The tool simplifies the process of querying multiple LLM backends with varying parameters to produce diverse datasets.

How It Works

The tool leverages a flexible backend system, supporting APIs from OpenAI (GPT-3, GPT-4), Cohere, and Petals. Users define prompts and specify LLM parameters like temperature and max length. For conversational datasets, it simulates dialogues between two configurable agents, allowing for control over agent roles, conversation length, and interruption strategies (fixed length or end phrase). The CLI and Python library generate datasets by systematically exploring combinations of specified parameters and backends.

Quick Start & Requirements

  • Install via pip: pip install datasetGPT
  • Additional packages for specific backends: pip install openai cohere petals
  • Requires API keys for services like OpenAI and Cohere.
  • See CLI reference and Python examples in the README for detailed usage.

Highlighted Details

  • Supports generation of both raw text and multi-turn conversations.
  • Allows parallel inference across multiple LLM backends (OpenAI, Cohere, Petals).
  • Enables systematic exploration of parameter space (temperature, length, custom options) for dataset diversity.
  • Outputs datasets in structured JSON format, suitable for further processing.

Maintenance & Community

The project is under active development, with contributions welcomed. Planned features include dataset transformations and support for more LLM backends.

Licensing & Compatibility

The tool itself is distributed freely without downstream use restrictions. However, users must adhere to the terms of service of the backend LLM APIs they utilize.

Limitations & Caveats

The project is still under active development, with features like dataset transformations not yet implemented. Support for specific LLM models is dependent on the availability and API access provided by the respective LLM providers.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.