easy-dataset  by ConardLi

Dataset tool for LLM fine-tuning

Created 11 months ago
13,421 stars

Top 3.7% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a specialized application for creating fine-tuning datasets for Large Language Models (LLMs). It targets users who need to transform domain-specific knowledge into structured training data for LLM APIs, offering an intuitive interface for document processing, question generation, and data export.

How It Works

Easy Dataset leverages intelligent document processing to split uploaded Markdown files into meaningful segments. It then uses LLM APIs to generate questions from these segments and subsequently generate comprehensive answers. The application supports flexible editing of all generated content and offers multiple export formats like Alpaca and ShareGPT in JSON or JSONL.

Quick Start & Requirements

  • Install: Download client (Windows, macOS, Linux) or use npm/pnpm with Node.js 18.x+.
  • Build with Docker: docker build -t easy-dataset . then docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset.
  • Documentation: https://rncly5jvpme.feishu.cn/docx/IRuad1eUIo8qLoxxwAGcZvqJnDb

Highlighted Details

  • Intelligent document splitting and smart question generation from text segments.
  • Supports editing of questions, answers, and datasets at any stage.
  • Exports datasets in Alpaca and ShareGPT formats (JSON, JSONL).
  • Compatible with all OpenAI-format compatible LLM APIs.

Maintenance & Community

  • Actively maintained by ConardLi.
  • Community contributions are welcomed via pull requests.

Licensing & Compatibility

  • Licensed under the Apache License 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The project relies on external LLM APIs for question and answer generation, meaning the quality and cost are dependent on the chosen LLM provider.

Health Check
Last Commit

18 hours ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
9
Star History
495 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

0.6%
2k
Synthetic data CLI tool for LLM fine-tuning
Created 11 months ago
Updated 4 months ago
Starred by Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), Malte Pietsch Malte Pietsch(Cofounder of deepset), and
5 more.

question_generation by patil-suraj

0%
1k
Question generation study using transformers
Created 5 years ago
Updated 1 year ago
Starred by Alex Atallah Alex Atallah(Cofounder of OpenRouter, OpenSea), Shyamal Anadkat Shyamal Anadkat(Research Scientist at OpenAI), and
1 more.

gpt-llm-trainer by mshumer

0.0%
4k
LLM fine-tuning pipeline
Created 2 years ago
Updated 9 months ago
Feedback? Help us improve.