easy-dataset  by ConardLi

Dataset tool for LLM fine-tuning

Created 6 months ago
10,781 stars

Top 4.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides a specialized application for creating fine-tuning datasets for Large Language Models (LLMs). It targets users who need to transform domain-specific knowledge into structured training data for LLM APIs, offering an intuitive interface for document processing, question generation, and data export.

How It Works

Easy Dataset leverages intelligent document processing to split uploaded Markdown files into meaningful segments. It then uses LLM APIs to generate questions from these segments and subsequently generate comprehensive answers. The application supports flexible editing of all generated content and offers multiple export formats like Alpaca and ShareGPT in JSON or JSONL.

Quick Start & Requirements

  • Install: Download client (Windows, macOS, Linux) or use npm/pnpm with Node.js 18.x+.
  • Build with Docker: docker build -t easy-dataset . then docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset.
  • Documentation: https://rncly5jvpme.feishu.cn/docx/IRuad1eUIo8qLoxxwAGcZvqJnDb

Highlighted Details

  • Intelligent document splitting and smart question generation from text segments.
  • Supports editing of questions, answers, and datasets at any stage.
  • Exports datasets in Alpaca and ShareGPT formats (JSON, JSONL).
  • Compatible with all OpenAI-format compatible LLM APIs.

Maintenance & Community

  • Actively maintained by ConardLi.
  • Community contributions are welcomed via pull requests.

Licensing & Compatibility

  • Licensed under the Apache License 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The project relies on external LLM APIs for question and answer generation, meaning the quality and cost are dependent on the chosen LLM provider.

Health Check
Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
4
Issues (30d)
46
Star History
600 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

0.8%
1k
Synthetic data CLI tool for LLM fine-tuning
Created 5 months ago
Updated 1 month ago
Feedback? Help us improve.