CLI tool for generating textual/conversational datasets using LLMs
Top 89.6% on sourcepulse
datasetGPT is a command-line interface and Python library for generating textual and conversational datasets using Large Language Models (LLMs). It targets researchers and developers needing to create training data for AI content detectors, analyze LLM behavior, or automate LLM-driven text generation tasks. The tool simplifies the process of querying multiple LLM backends with varying parameters to produce diverse datasets.
How It Works
The tool leverages a flexible backend system, supporting APIs from OpenAI (GPT-3, GPT-4), Cohere, and Petals. Users define prompts and specify LLM parameters like temperature and max length. For conversational datasets, it simulates dialogues between two configurable agents, allowing for control over agent roles, conversation length, and interruption strategies (fixed length or end phrase). The CLI and Python library generate datasets by systematically exploring combinations of specified parameters and backends.
Quick Start & Requirements
pip install datasetGPT
pip install openai cohere petals
Highlighted Details
Maintenance & Community
The project is under active development, with contributions welcomed. Planned features include dataset transformations and support for more LLM backends.
Licensing & Compatibility
The tool itself is distributed freely without downstream use restrictions. However, users must adhere to the terms of service of the backend LLM APIs they utilize.
Limitations & Caveats
The project is still under active development, with features like dataset transformations not yet implemented. Support for specific LLM models is dependent on the availability and API access provided by the respective LLM providers.
1 year ago
1 day