easy-dataset by ConardLi

Dataset tool for LLM fine-tuning

Created 10 months ago

12,738 stars

Top 3.9% on SourcePulse

2 Experts Love This Project

chiphuyen

Author of "AI Engineering", "Designing Machine Learning Systems"

hiyouga

Author of LLaMA-Factory

Project Summary

This project provides a specialized application for creating fine-tuning datasets for Large Language Models (LLMs). It targets users who need to transform domain-specific knowledge into structured training data for LLM APIs, offering an intuitive interface for document processing, question generation, and data export.

How It Works

Easy Dataset leverages intelligent document processing to split uploaded Markdown files into meaningful segments. It then uses LLM APIs to generate questions from these segments and subsequently generate comprehensive answers. The application supports flexible editing of all generated content and offers multiple export formats like Alpaca and ShareGPT in JSON or JSONL.

Quick Start & Requirements

Install: Download client (Windows, macOS, Linux) or use npm/pnpm with Node.js 18.x+.
Build with Docker: docker build -t easy-dataset . then docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset.
Documentation: https://rncly5jvpme.feishu.cn/docx/IRuad1eUIo8qLoxxwAGcZvqJnDb

Highlighted Details

Intelligent document splitting and smart question generation from text segments.
Supports editing of questions, answers, and datasets at any stage.
Exports datasets in Alpaca and ShareGPT formats (JSON, JSONL).
Compatible with all OpenAI-format compatible LLM APIs.

Maintenance & Community

Actively maintained by ConardLi.
Community contributions are welcomed via pull requests.

Licensing & Compatibility

Licensed under the Apache License 2.0.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

The project relies on external LLM APIs for question and answer generation, meaning the quality and cost are dependent on the chosen LLM provider.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

2

Issues (30d)

23

Star History

362 stars in the last 30 days

Explore Similar Projects

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

GenRead by wyu97

Research paper code for context generation using LLMs

Created 3 years ago

Updated 2 years ago

ChatKBQA by LHRLAB

Research paper resources for knowledge base question answering

Created 2 years ago

Updated 3 months ago

MarkEverythingDown by RoffyS

Markdown conversion tool for LLMs

Created 10 months ago

Updated 8 months ago

gpt-oracle-trainer by mshumer

Tool for chatbot creation via documentation Q&A

Created 2 years ago

Updated 2 years ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind),

Philipp Schmid

Philipp Schmid(DevRel at Google DeepMind), and

2 more.

textbook_quality by VikParuchuri

Synthetic data generator for LLM pretraining

Created 2 years ago

Updated 2 years ago

LangChain.js-LLM-Template by IroncladDev

LangChain template for custom AI model training

Created 2 years ago

Updated 2 years ago

Starred by

Krrish Dholakia

Krrish Dholakia(Cofounder of LiteLLM),

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind), and

6 more.

auto-evaluator by rlancemartin

Evaluation tool for LLM QA chains

Created 2 years ago

Updated 2 years ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Daniel Han

Daniel Han(Cofounder of Unsloth), and

1 more.

synthetic-data-kit by meta-llama

Synthetic data CLI tool for LLM fine-tuning

Created 9 months ago

Updated 2 months ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA),

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI), and

1 more.

researchgpt by mukulpatnaik

LLM-based research assistant for conversing with PDFs

Created 2 years ago

Updated 2 years ago

Starred by

Edward Sun

Edward Sun(Research Scientist at Meta Superintelligence Lab),

Malte Pietsch

Malte Pietsch(Cofounder of deepset), and

5 more.

question_generation by patil-suraj

Question generation study using transformers

Created 5 years ago

Updated 1 year ago

Starred by

Alex Atallah

Alex Atallah(Cofounder of OpenRouter, OpenSea),

Shyamal Anadkat

Shyamal Anadkat(Research Scientist at OpenAI), and

1 more.

gpt-llm-trainer by mshumer

LLM fine-tuning pipeline

Created 2 years ago

Updated 8 months ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

rag-from-scratch by langchain-ai

RAG tutorial for expanding LLM knowledge via external data

Created 1 year ago

Updated 6 months ago

Feedback? Help us improve.