DatasetLoom  by 599yongyang

Platform for intelligent multimodal LLM training data construction

Created 8 months ago
255 stars

Top 98.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

DatasetLoom addresses the challenge of constructing high-quality, multimodal datasets for training large language models. Targeting AI engineers and researchers, this platform streamlines the entire data pipeline, from document parsing and annotation to AI-driven evaluation and export, enabling the creation of more professional, accurate, and traceable SFT and DPO training data, significantly enhanced by RAG capabilities.

How It Works

Built on a modern Monorepo architecture (Next.js, NestJS, Turborepo), DatasetLoom offers a decoupled, maintainable, and extensible solution. Its core approach integrates multimodal data ingestion (images, PDFs, etc.), intelligent document chunking, image annotation, and AI-powered automatic scoring and comparison of model outputs. A key differentiator is its RAG-enhanced dialogue generation, leveraging Qdrant vector databases to ground responses in real documents, producing specialized, verifiable SFT/DPO datasets.

Quick Start & Requirements

  • Installation: Clone the repository, install dependencies using pnpm install, configure .env, initialize the database with pnpm --filter=api prisma:migrate, and run pnpm run dev for development or docker compose up -d --build for production.
  • Prerequisites: Node.js with pnpm, PostgreSQL, and Qdrant (included via Docker Compose).
  • Links: API Docs: http://localhost:3088/api-docs, Qdrant UI: http://localhost:6333/dashboard.

Highlighted Details

  • Supports diverse multimodal data formats (images, PDF, Word, TXT).
  • Features an AI auto-scoring system for evaluating model output quality.
  • Integrates RAG with Qdrant for generating grounded, professional dialogue datasets.
  • Enables multi-user collaboration with role-based permissions.
  • Exports datasets in JSON, CSV, and HuggingFace Dataset formats.
  • Accommodates various embedding models (OpenAI, Hugging Face, local).

Maintenance & Community

The project provides a contribution guide and welcomes Issues and Pull Requests via its GitHub repository. Community interaction appears centered on GitHub, with no dedicated Discord or Slack channels mentioned.

Licensing & Compatibility

DatasetLoom is released under the MIT License, permitting broad usage including modification, distribution, and commercial application without significant restrictions.

Limitations & Caveats

The platform requires familiarity with a modern web stack (Next.js, NestJS) and containerization (Docker). Setting up the development environment involves managing multiple services (PostgreSQL, Qdrant). While feature-rich, the RAG and AI scoring components may necessitate substantial configuration effort and potentially incur costs or require significant hardware resources depending on the chosen models.

Health Check
Last Commit

4 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
13 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.