awesome-matchem-datasets  by blaiszik

Curated AI/ML datasets for chemistry and materials science research

Created 7 months ago
251 stars

Top 99.8% on SourcePulse

GitHubView on GitHub
Project Summary

This repository serves as a comprehensive, community-curated catalog of datasets for training machine learning and AI foundation models in materials science and chemistry. It targets researchers and developers by aggregating experimental, computational, and literature-mined data, prioritizing open-access resources to foster reproducible research and accelerate AI-driven discovery in these fields.

How It Works

The project functions as an organized list, categorizing datasets by domain, type, quality, and size. It emphasizes open-access resources and relies on community contributions for expansion and refinement. This collaborative approach ensures a centralized, up-to-date, and community-vetted resource for a rapidly evolving research landscape.

Quick Start & Requirements

This repository is a curated list and does not require installation or execution as a software package. Users can browse the categorized lists of datasets and follow provided links to access and download data from their original sources. Each dataset has its own specific requirements and access methods.

Highlighted Details

  • Vast collection spanning Computational (DFT, MD), Experimental, LLM Training, Literature-mined, and Engineering (CFD, PDE) domains.
  • Features numerous large-scale datasets, some containing billions of data points or millions of calculations (e.g., QCML, OMol25, ChemPile).
  • Regularly updated with new datasets, as evidenced by changelog entries from mid-2025 onwards.
  • Detailed metadata for each dataset, including domain, size, type, and format, aids targeted discovery.

Maintenance & Community

The repository is community-driven, actively soliciting contributions for new datasets or metadata improvements via pull requests or issues. Acknowledgements highlight significant contributions from major research institutions and open data communities, indicating a collaborative ecosystem.

Licensing & Compatibility

The project itself is licensed under the MIT License. However, each dataset listed retains its own specific license. Users are explicitly advised to check the source's license before incorporating any data into their projects, which is critical for compatibility and commercial use considerations.

Limitations & Caveats

As a catalog, this repository does not host or distribute the datasets directly; users must obtain them from external sources, each with its own terms and accessibility. The utility and accuracy of the listed data depend entirely on the original providers, and the catalog's comprehensiveness relies on ongoing community engagement.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
6
Issues (30d)
3
Star History
19 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.