DatasetLoom by 599yongyang

Platform for intelligent multimodal LLM training data construction

Created 9 months ago

268 stars

Top 95.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

Summary

DatasetLoom addresses the challenge of constructing high-quality, multimodal datasets for training large language models. Targeting AI engineers and researchers, this platform streamlines the entire data pipeline, from document parsing and annotation to AI-driven evaluation and export, enabling the creation of more professional, accurate, and traceable SFT and DPO training data, significantly enhanced by RAG capabilities.

How It Works

Built on a modern Monorepo architecture (Next.js, NestJS, Turborepo), DatasetLoom offers a decoupled, maintainable, and extensible solution. Its core approach integrates multimodal data ingestion (images, PDFs, etc.), intelligent document chunking, image annotation, and AI-powered automatic scoring and comparison of model outputs. A key differentiator is its RAG-enhanced dialogue generation, leveraging Qdrant vector databases to ground responses in real documents, producing specialized, verifiable SFT/DPO datasets.

Quick Start & Requirements

Installation: Clone the repository, install dependencies using pnpm install, configure .env, initialize the database with pnpm --filter=api prisma:migrate, and run pnpm run dev for development or docker compose up -d --build for production.
Prerequisites: Node.js with pnpm, PostgreSQL, and Qdrant (included via Docker Compose).
Links: API Docs: http://localhost:3088/api-docs, Qdrant UI: http://localhost:6333/dashboard.

Highlighted Details

Supports diverse multimodal data formats (images, PDF, Word, TXT).
Features an AI auto-scoring system for evaluating model output quality.
Integrates RAG with Qdrant for generating grounded, professional dialogue datasets.
Enables multi-user collaboration with role-based permissions.
Exports datasets in JSON, CSV, and HuggingFace Dataset formats.
Accommodates various embedding models (OpenAI, Hugging Face, local).

Maintenance & Community

The project provides a contribution guide and welcomes Issues and Pull Requests via its GitHub repository. Community interaction appears centered on GitHub, with no dedicated Discord or Slack channels mentioned.

Licensing & Compatibility

DatasetLoom is released under the MIT License, permitting broad usage including modification, distribution, and commercial application without significant restrictions.

Limitations & Caveats

The platform requires familiarity with a modern web stack (Next.js, NestJS) and containerization (Docker). Setting up the development environment involves managing multiple services (PostgreSQL, Qdrant). While feature-rich, the RAG and AI scoring components may necessitate substantial configuration effort and potentially incur costs or require significant hardware resources depending on the chosen models.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days