dingo by MigoXLab

AI data quality evaluation tool

Created 10 months ago

528 stars

Top 59.9% on SourcePulse

Project Summary

Dingo is a comprehensive AI data quality evaluation tool designed for LLM and multimodal datasets, targeting researchers and engineers. It automates the detection of data quality issues using a flexible system of built-in and custom rules and model-based assessments, enhancing dataset reliability for pre-training, fine-tuning, and evaluation stages.

How It Works

Dingo employs a hybrid approach combining rule-based checks and LLM-driven evaluations. Rule-based checks utilize over 20 heuristic rules for common issues like completeness and format, while LLM evaluations leverage models (OpenAI, Kimi, local) with customizable prompts to assess quality dimensions such as helpfulness, harmlessness, and relevance. This dual approach allows for both automated, deterministic checks and nuanced, context-aware quality assessments.

Quick Start & Requirements

Install via pip: pip install dingo-python
Requires Python 3.7+
LLM evaluations require API keys and potentially specific model access (e.g., OpenAI API key).
Local demo and Colab notebooks are available for quick testing.
Official documentation and demos are linked within the README.

Highlighted Details

Supports text and image data modalities across various dataset types (pre-training, SFT, evaluation).
Offers pre-configured rule groups for common use cases like SFT, RAG, and hallucination detection.
Integrates with platforms like OpenCompass and includes an experimental Model Context Protocol (MCP) server for tools like Cursor.
Provides detailed evaluation reports, including summary statistics and per-item issue tracking.

Maintenance & Community

The project is actively maintained by the Dingo Contributors. Community engagement is encouraged via Discord and WeChat. Contribution guidelines are provided.

Licensing & Compatibility

Licensed under Apache 2.0. Dependencies like fasttext use the MIT License, which is compatible. This license permits commercial use and integration with closed-source projects.

Limitations & Caveats

The current rule and model focus on common data quality problems; specialized needs may require custom rule development. Future plans include expanding to audio/video modalities and small model evaluation.

dingo by MigoXLab

Explore Similar Projects

T-Eval by open-compass

open-rag-eval by vectara

pointblank by posit-dev

tonic_validate by TonicAI

olmes by allenai

prometheus-eval by prometheus-eval

Streamline-Analyst by Wilson-ZheLin

SwanLab by SwanHubX

data-prep-kit by data-prep-kit

lighteval by huggingface

evidently by evidentlyai

opik by comet-ml