Discover and explore top open-source AI tools and projects—updated daily.
blaiszikCurated AI/ML datasets for chemistry and materials science research
Top 99.8% on SourcePulse
This repository serves as a comprehensive, community-curated catalog of datasets for training machine learning and AI foundation models in materials science and chemistry. It targets researchers and developers by aggregating experimental, computational, and literature-mined data, prioritizing open-access resources to foster reproducible research and accelerate AI-driven discovery in these fields.
How It Works
The project functions as an organized list, categorizing datasets by domain, type, quality, and size. It emphasizes open-access resources and relies on community contributions for expansion and refinement. This collaborative approach ensures a centralized, up-to-date, and community-vetted resource for a rapidly evolving research landscape.
Quick Start & Requirements
This repository is a curated list and does not require installation or execution as a software package. Users can browse the categorized lists of datasets and follow provided links to access and download data from their original sources. Each dataset has its own specific requirements and access methods.
Highlighted Details
Maintenance & Community
The repository is community-driven, actively soliciting contributions for new datasets or metadata improvements via pull requests or issues. Acknowledgements highlight significant contributions from major research institutions and open data communities, indicating a collaborative ecosystem.
Licensing & Compatibility
The project itself is licensed under the MIT License. However, each dataset listed retains its own specific license. Users are explicitly advised to check the source's license before incorporating any data into their projects, which is critical for compatibility and commercial use considerations.
Limitations & Caveats
As a catalog, this repository does not host or distribute the datasets directly; users must obtain them from external sources, each with its own terms and accessibility. The utility and accuracy of the listed data depend entirely on the original providers, and the catalog's comprehensiveness relies on ongoing community engagement.
4 days ago
Inactive
mlcommons
mlfoundations
szilard