dingo  by MigoXLab

AI data quality evaluation tool

Created 8 months ago
464 stars

Top 65.4% on SourcePulse

GitHubView on GitHub
Project Summary

Dingo is a comprehensive AI data quality evaluation tool designed for LLM and multimodal datasets, targeting researchers and engineers. It automates the detection of data quality issues using a flexible system of built-in and custom rules and model-based assessments, enhancing dataset reliability for pre-training, fine-tuning, and evaluation stages.

How It Works

Dingo employs a hybrid approach combining rule-based checks and LLM-driven evaluations. Rule-based checks utilize over 20 heuristic rules for common issues like completeness and format, while LLM evaluations leverage models (OpenAI, Kimi, local) with customizable prompts to assess quality dimensions such as helpfulness, harmlessness, and relevance. This dual approach allows for both automated, deterministic checks and nuanced, context-aware quality assessments.

Quick Start & Requirements

  • Install via pip: pip install dingo-python
  • Requires Python 3.7+
  • LLM evaluations require API keys and potentially specific model access (e.g., OpenAI API key).
  • Local demo and Colab notebooks are available for quick testing.
  • Official documentation and demos are linked within the README.

Highlighted Details

  • Supports text and image data modalities across various dataset types (pre-training, SFT, evaluation).
  • Offers pre-configured rule groups for common use cases like SFT, RAG, and hallucination detection.
  • Integrates with platforms like OpenCompass and includes an experimental Model Context Protocol (MCP) server for tools like Cursor.
  • Provides detailed evaluation reports, including summary statistics and per-item issue tracking.

Maintenance & Community

The project is actively maintained by the Dingo Contributors. Community engagement is encouraged via Discord and WeChat. Contribution guidelines are provided.

Licensing & Compatibility

Licensed under Apache 2.0. Dependencies like fasttext use the MIT License, which is compatible. This license permits commercial use and integration with closed-source projects.

Limitations & Caveats

The current rule and model focus on common data quality problems; specialized needs may require custom rule development. Future plans include expanding to audio/video modalities and small model evaluation.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
33
Issues (30d)
8
Star History
118 stars in the last 30 days

Explore Similar Projects

Starred by Morgan Funtowicz Morgan Funtowicz(Head of ML Optimizations at Hugging Face), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
7 more.

lighteval by huggingface

2.6%
2k
LLM evaluation toolkit for multiple backends
Created 1 year ago
Updated 1 day ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

argilla by argilla-io

0.2%
5k
Collaboration tool for building high-quality AI datasets
Created 4 years ago
Updated 3 days ago
Starred by Han Wang Han Wang(Cofounder of Mintlify), John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), and
6 more.

evidently by evidentlyai

0.3%
7k
Open-source framework for ML/LLM observability
Created 4 years ago
Updated 19 hours ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

opik by comet-ml

1.7%
14k
Open-source LLM evaluation framework for RAG, agents, and more
Created 2 years ago
Updated 17 hours ago
Feedback? Help us improve.