data-is-better-together by huggingface

Datasets for community-driven AI model training and evaluation

Created 2 years ago

273 stars

Top 94.3% on SourcePulse

1 Expert Loves This Project

osanseviero

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

This initiative empowers the open-source community to collaboratively build impactful datasets for AI models. It targets researchers and developers seeking high-quality, community-vetted datasets, offering curated resources and tools to facilitate data creation and annotation.

How It Works

The project comprises two main components: community efforts and cookbook efforts. Community efforts involve hands-on projects guided by Hugging Face, such as prompt ranking and image preference annotation, leveraging community participation to create large-scale datasets. Cookbook efforts provide standalone guides and tools for users to independently build domain-specific or preference-based datasets (DPO, ORPO, KTO).

Quick Start & Requirements

Community Datasets: Available on the Hugging Face Hub (e.g., data-is-better-together/10k_prompts_ranked, data-is-better-together/open-image-preferences-v1-binarized).
Cookbook Tools: Instructions and guides are provided within individual project READMEs (e.g., cookbook-efforts/domain-specific-datasets/README.md).
Dependencies: Primarily relies on Hugging Face libraries and tools. Specific requirements vary per project.

Highlighted Details

Successfully created and released data-is-better-together/10k_prompts_ranked with over 385 contributors.
Developed multilingual benchmarks (MPEP) by translating high-quality prompts into Dutch, Russian, and Spanish.
Generated 10K text-to-image preference pairs for evaluating image generation models.
Cookbook efforts aim to facilitate domain-specific datasets and preference data (DPO, ORPO, KTO) creation.

Maintenance & Community

A collaboration between Hugging Face, Argilla, and the open-source ML community.
Community participation is encouraged via Hugging Face Discord and project-specific READMEs.

Licensing & Compatibility

Datasets are typically released under permissive licenses allowing for broad use. Specific licenses should be checked on the Hugging Face Hub for each dataset.

Limitations & Caveats

Some community projects are still in progress or have limited language support for translations. Cookbook efforts are designed for standalone use, with varying levels of community guidance.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

1 stars in the last 30 days

Explore Similar Projects

Open-Qwen2VL by Victorwz

Multimodal LLM pre-training and fine-tuning

Created 1 year ago

Updated 10 months ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI).

awesome-synthetic-datasets by davanstrien

Curated list of synthetic text/vision datasets and generation tools

Created 2 years ago

Updated 6 months ago

Starred by

Nathan Lambert

Nathan Lambert(Research Scientist at AI2).

RLAIF-V by RLHF-V

Framework for aligning MLLMs using open-source AI feedback

Created 2 years ago

Updated 1 year ago

FlagEval by flageval-baai

Evaluation toolkit for large AI foundation models

Created 3 years ago

Updated 1 year ago

MedTrinity-25M by UCSC-VLAA

Large-scale multimodal dataset for medicine research

Created 1 year ago

Updated 1 year ago

Awesome-CV-Foundational-Models by awaisrauf

Vision-language survey paper with curated list of foundational CV models

Created 3 years ago

Updated 1 year ago

Awesome_Matching_Pretraining_Transfering by Paranioar

Curated paper list for multimodal AI research

Created 5 years ago

Updated 9 months ago

Starred by

Phil Wang

Phil Wang(Prolific Research Paper Implementer).

molmo by allenai

Multimodal open language model code, training, and evaluation

Created 1 year ago

Updated 1 year ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo) and

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

TextBox by RUCAIBox

Text generation library with pre-trained language models

Created 5 years ago

Updated 3 years ago

Awesome-LLMs-Datasets by lmmlzn

LLM datasets survey for pre-training, fine-tuning, preference, evaluation, and NLP

Created 2 years ago

Updated 4 months ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo),

Luca Soldaini

Luca Soldaini(Research Scientist at Ai2), and

5 more.

llm-datasets by mlabonne

Curated datasets/tools for LLM post-training

Created 2 years ago

Updated 2 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind),

Maxime Labonne

Maxime Labonne(Head of Post-Training at Liquid AI), and

9 more.

argilla by argilla-io

Collaboration tool for building high-quality AI datasets

Created 5 years ago

Updated 1 week ago

Feedback? Help us improve.