Corpus for NLP/text mining research on scientific papers
Top 39.3% on sourcepulse
S2ORC provides a comprehensive corpus of scientific papers for NLP and text mining research. It is designed for researchers and developers working with large-scale scientific literature, offering full-text access and structured metadata to facilitate advanced analysis and model training.
How It Works
S2ORC is derived from the Semantic Scholar dataset, focusing on machine-readable full text extracted from paper PDFs. The corpus is processed and made available through the Semantic Scholar Public API, allowing users to access continuously updated versions and bulk dataset downloads. This approach ensures data freshness and scalability.
Quick Start & Requirements
requests
and wget
to download data shards.requests
, wget
, tqdm
.Highlighted Details
Maintenance & Community
S2ORC is now maintained by the Semantic Scholar API product team. For support, bug reports, or feature requests, users are directed to search and create issues on the s2-folks GitHub repo. Contact is primarily via email.
Licensing & Compatibility
Released under the Open Data Commons Attribution License (ODC-By 1.0). Users must comply with the license terms, which permit commercial use but require attribution.
Limitations & Caveats
Original S2ORC dataset files are no longer available for download. The current version is a reimplementation within the Semantic Scholar data pipeline, and there may be low-level implementation differences compared to older research releases. Users are advised to verify usage permissibility under the ODC-By 1.0 license.
1 year ago
1 week