data-is-better-together  by huggingface

Datasets for community-driven AI model training and evaluation

created 1 year ago
260 stars

Top 98.2% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This initiative empowers the open-source community to collaboratively build impactful datasets for AI models. It targets researchers and developers seeking high-quality, community-vetted datasets, offering curated resources and tools to facilitate data creation and annotation.

How It Works

The project comprises two main components: community efforts and cookbook efforts. Community efforts involve hands-on projects guided by Hugging Face, such as prompt ranking and image preference annotation, leveraging community participation to create large-scale datasets. Cookbook efforts provide standalone guides and tools for users to independently build domain-specific or preference-based datasets (DPO, ORPO, KTO).

Quick Start & Requirements

  • Community Datasets: Available on the Hugging Face Hub (e.g., data-is-better-together/10k_prompts_ranked, data-is-better-together/open-image-preferences-v1-binarized).
  • Cookbook Tools: Instructions and guides are provided within individual project READMEs (e.g., cookbook-efforts/domain-specific-datasets/README.md).
  • Dependencies: Primarily relies on Hugging Face libraries and tools. Specific requirements vary per project.

Highlighted Details

  • Successfully created and released data-is-better-together/10k_prompts_ranked with over 385 contributors.
  • Developed multilingual benchmarks (MPEP) by translating high-quality prompts into Dutch, Russian, and Spanish.
  • Generated 10K text-to-image preference pairs for evaluating image generation models.
  • Cookbook efforts aim to facilitate domain-specific datasets and preference data (DPO, ORPO, KTO) creation.

Maintenance & Community

  • A collaboration between Hugging Face, Argilla, and the open-source ML community.
  • Community participation is encouraged via Hugging Face Discord and project-specific READMEs.

Licensing & Compatibility

  • Datasets are typically released under permissive licenses allowing for broad use. Specific licenses should be checked on the Hugging Face Hub for each dataset.

Limitations & Caveats

  • Some community projects are still in progress or have limited language support for translations. Cookbook efforts are designed for standalone use, with varying levels of community guidance.
Health Check
Last commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

argilla by argilla-io

0.4%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 5 days ago
Feedback? Help us improve.