s2orc by allenai

Corpus for NLP/text mining research on scientific papers

Created 6 years ago

1,072 stars

Top 34.6% on SourcePulse

View on GitHub

3 Experts Love This Project

Luca Soldaini

Research Scientist at Ai2

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

S2ORC provides a comprehensive corpus of scientific papers for NLP and text mining research. It is designed for researchers and developers working with large-scale scientific literature, offering full-text access and structured metadata to facilitate advanced analysis and model training.

How It Works

S2ORC is derived from the Semantic Scholar dataset, focusing on machine-readable full text extracted from paper PDFs. The corpus is processed and made available through the Semantic Scholar Public API, allowing users to access continuously updated versions and bulk dataset downloads. This approach ensures data freshness and scalability.

Quick Start & Requirements

Access: Via Semantic Scholar Public API. Obtain an API key.
Download: Python script using requests and wget to download data shards.
Prerequisites: Python 3, requests, wget, tqdm.
Documentation: Semantic Scholar Public API

Highlighted Details

Provides access to over 12 million full-text scientific papers.
Continuously updated with new publications.
Includes structured metadata such as title, authors, year, venue, and abstract.
Original research versions are deprecated; API access is recommended.

Maintenance & Community

S2ORC is now maintained by the Semantic Scholar API product team. For support, bug reports, or feature requests, users are directed to search and create issues on the s2-folks GitHub repo. Contact is primarily via email.

Licensing & Compatibility

Released under the Open Data Commons Attribution License (ODC-By 1.0). Users must comply with the license terms, which permit commercial use but require attribution.

Limitations & Caveats

Original S2ORC dataset files are no longer available for download. The current version is a reimplementation within the Semantic Scholar data pipeline, and there may be low-level implementation differences compared to older research releases. Users are advised to verify usage permissibility under the ODC-By 1.0 license.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days