s2orc  by allenai

Corpus for NLP/text mining research on scientific papers

created 5 years ago
955 stars

Top 39.3% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

S2ORC provides a comprehensive corpus of scientific papers for NLP and text mining research. It is designed for researchers and developers working with large-scale scientific literature, offering full-text access and structured metadata to facilitate advanced analysis and model training.

How It Works

S2ORC is derived from the Semantic Scholar dataset, focusing on machine-readable full text extracted from paper PDFs. The corpus is processed and made available through the Semantic Scholar Public API, allowing users to access continuously updated versions and bulk dataset downloads. This approach ensures data freshness and scalability.

Quick Start & Requirements

  • Access: Via Semantic Scholar Public API. Obtain an API key.
  • Download: Python script using requests and wget to download data shards.
  • Prerequisites: Python 3, requests, wget, tqdm.
  • Documentation: Semantic Scholar Public API

Highlighted Details

  • Provides access to over 12 million full-text scientific papers.
  • Continuously updated with new publications.
  • Includes structured metadata such as title, authors, year, venue, and abstract.
  • Original research versions are deprecated; API access is recommended.

Maintenance & Community

S2ORC is now maintained by the Semantic Scholar API product team. For support, bug reports, or feature requests, users are directed to search and create issues on the s2-folks GitHub repo. Contact is primarily via email.

Licensing & Compatibility

Released under the Open Data Commons Attribution License (ODC-By 1.0). Users must comply with the license terms, which permit commercial use but require attribution.

Limitations & Caveats

Original S2ORC dataset files are no longer available for download. The current version is a reimplementation within the Semantic Scholar data pipeline, and there may be low-level implementation differences compared to older research releases. Users are advised to verify usage permissibility under the ODC-By 1.0 license.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
40 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

LightRAG by HKUDS

1.0%
19k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 23 hours ago
Feedback? Help us improve.