Discover and explore top open-source AI tools and projects—updated daily.
599yongyangPlatform for intelligent multimodal LLM training data construction
Top 98.8% on SourcePulse
Summary
DatasetLoom addresses the challenge of constructing high-quality, multimodal datasets for training large language models. Targeting AI engineers and researchers, this platform streamlines the entire data pipeline, from document parsing and annotation to AI-driven evaluation and export, enabling the creation of more professional, accurate, and traceable SFT and DPO training data, significantly enhanced by RAG capabilities.
How It Works
Built on a modern Monorepo architecture (Next.js, NestJS, Turborepo), DatasetLoom offers a decoupled, maintainable, and extensible solution. Its core approach integrates multimodal data ingestion (images, PDFs, etc.), intelligent document chunking, image annotation, and AI-powered automatic scoring and comparison of model outputs. A key differentiator is its RAG-enhanced dialogue generation, leveraging Qdrant vector databases to ground responses in real documents, producing specialized, verifiable SFT/DPO datasets.
Quick Start & Requirements
pnpm install, configure .env, initialize the database with pnpm --filter=api prisma:migrate, and run pnpm run dev for development or docker compose up -d --build for production.pnpm, PostgreSQL, and Qdrant (included via Docker Compose).http://localhost:3088/api-docs, Qdrant UI: http://localhost:6333/dashboard.Highlighted Details
Maintenance & Community
The project provides a contribution guide and welcomes Issues and Pull Requests via its GitHub repository. Community interaction appears centered on GitHub, with no dedicated Discord or Slack channels mentioned.
Licensing & Compatibility
DatasetLoom is released under the MIT License, permitting broad usage including modification, distribution, and commercial application without significant restrictions.
Limitations & Caveats
The platform requires familiarity with a modern web stack (Next.js, NestJS) and containerization (Docker). Setting up the development environment involves managing multiple services (PostgreSQL, Qdrant). While feature-rich, the RAG and AI scoring components may necessitate substantial configuration effort and potentially incur costs or require significant hardware resources depending on the chosen models.
4 weeks ago
Inactive
unum-cloud
docling-project