s2orc  by allenai

Corpus for NLP/text mining research on scientific papers

Created 6 years ago
979 stars

Top 37.8% on SourcePulse

GitHubView on GitHub
Project Summary

S2ORC provides a comprehensive corpus of scientific papers for NLP and text mining research. It is designed for researchers and developers working with large-scale scientific literature, offering full-text access and structured metadata to facilitate advanced analysis and model training.

How It Works

S2ORC is derived from the Semantic Scholar dataset, focusing on machine-readable full text extracted from paper PDFs. The corpus is processed and made available through the Semantic Scholar Public API, allowing users to access continuously updated versions and bulk dataset downloads. This approach ensures data freshness and scalability.

Quick Start & Requirements

  • Access: Via Semantic Scholar Public API. Obtain an API key.
  • Download: Python script using requests and wget to download data shards.
  • Prerequisites: Python 3, requests, wget, tqdm.
  • Documentation: Semantic Scholar Public API

Highlighted Details

  • Provides access to over 12 million full-text scientific papers.
  • Continuously updated with new publications.
  • Includes structured metadata such as title, authors, year, venue, and abstract.
  • Original research versions are deprecated; API access is recommended.

Maintenance & Community

S2ORC is now maintained by the Semantic Scholar API product team. For support, bug reports, or feature requests, users are directed to search and create issues on the s2-folks GitHub repo. Contact is primarily via email.

Licensing & Compatibility

Released under the Open Data Commons Attribution License (ODC-By 1.0). Users must comply with the license terms, which permit commercial use but require attribution.

Limitations & Caveats

Original S2ORC dataset files are no longer available for download. The current version is a reimplementation within the Semantic Scholar data pipeline, and there may be low-level implementation differences compared to older research releases. Users are advised to verify usage permissibility under the ODC-By 1.0 license.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
10 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.