easy-dataset  by ConardLi

Dataset tool for LLM fine-tuning

created 5 months ago
9,820 stars

Top 5.2% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a specialized application for creating fine-tuning datasets for Large Language Models (LLMs). It targets users who need to transform domain-specific knowledge into structured training data for LLM APIs, offering an intuitive interface for document processing, question generation, and data export.

How It Works

Easy Dataset leverages intelligent document processing to split uploaded Markdown files into meaningful segments. It then uses LLM APIs to generate questions from these segments and subsequently generate comprehensive answers. The application supports flexible editing of all generated content and offers multiple export formats like Alpaca and ShareGPT in JSON or JSONL.

Quick Start & Requirements

  • Install: Download client (Windows, macOS, Linux) or use npm/pnpm with Node.js 18.x+.
  • Build with Docker: docker build -t easy-dataset . then docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset.
  • Documentation: https://rncly5jvpme.feishu.cn/docx/IRuad1eUIo8qLoxxwAGcZvqJnDb

Highlighted Details

  • Intelligent document splitting and smart question generation from text segments.
  • Supports editing of questions, answers, and datasets at any stage.
  • Exports datasets in Alpaca and ShareGPT formats (JSON, JSONL).
  • Compatible with all OpenAI-format compatible LLM APIs.

Maintenance & Community

  • Actively maintained by ConardLi.
  • Community contributions are welcomed via pull requests.

Licensing & Compatibility

  • Licensed under the Apache License 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The project relies on external LLM APIs for question and answer generation, meaning the quality and cost are dependent on the chosen LLM provider.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
11
Issues (30d)
47
Star History
3,787 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

1.6%
1k
Synthetic data CLI tool for LLM fine-tuning
created 4 months ago
updated 1 week ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

reor by reorproject

0.2%
8k
Local AI personal knowledge management app
created 1 year ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

LightRAG by HKUDS

1.0%
19k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 21 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Anton Troynikov Anton Troynikov(Cofounder of Chroma), and
20 more.

llama_index by run-llama

0.3%
43k
Data framework for building LLM-powered agents
created 2 years ago
updated 21 hours ago
Feedback? Help us improve.