datasetGPT by radi-cho

CLI tool for generating textual/conversational datasets using LLMs

Created 2 years ago

298 stars

Top 89.2% on SourcePulse

Project Summary

datasetGPT is a command-line interface and Python library for generating textual and conversational datasets using Large Language Models (LLMs). It targets researchers and developers needing to create training data for AI content detectors, analyze LLM behavior, or automate LLM-driven text generation tasks. The tool simplifies the process of querying multiple LLM backends with varying parameters to produce diverse datasets.

How It Works

The tool leverages a flexible backend system, supporting APIs from OpenAI (GPT-3, GPT-4), Cohere, and Petals. Users define prompts and specify LLM parameters like temperature and max length. For conversational datasets, it simulates dialogues between two configurable agents, allowing for control over agent roles, conversation length, and interruption strategies (fixed length or end phrase). The CLI and Python library generate datasets by systematically exploring combinations of specified parameters and backends.

Quick Start & Requirements

Install via pip: pip install datasetGPT
Additional packages for specific backends: pip install openai cohere petals
Requires API keys for services like OpenAI and Cohere.
See CLI reference and Python examples in the README for detailed usage.

Highlighted Details

Supports generation of both raw text and multi-turn conversations.
Allows parallel inference across multiple LLM backends (OpenAI, Cohere, Petals).
Enables systematic exploration of parameter space (temperature, length, custom options) for dataset diversity.
Outputs datasets in structured JSON format, suitable for further processing.

Maintenance & Community

The project is under active development, with contributions welcomed. Planned features include dataset transformations and support for more LLM backends.

Licensing & Compatibility

The tool itself is distributed freely without downstream use restrictions. However, users must adhere to the terms of service of the backend LLM APIs they utilize.

Limitations & Caveats

The project is still under active development, with features like dataset transformations not yet implemented. Support for specific LLM models is dependent on the availability and API access provided by the respective LLM providers.

datasetGPT by radi-cho

Explore Similar Projects

datadm by approximatelabs

awesome-instruction-datasets by jianzhnie

lollms_legacy by ParisNeo

bce-qianfan-sdk by baidubce

deepfabric by always-further

LLM-Kit by wpydcr

awesome-chatgpt-project by xianyu110

BambooAI by pgalko

elia by darrenburns

Chatito by rodrigopivi

augmentoolkit by e-p-armstrong

baize-chatbot by project-baize