corus  by natasha

Python SDK for Russian text corpora

created 6 years ago
300 stars

Top 89.7% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides Python functions for loading and parsing a wide variety of Russian language text corpora, totaling over 350GB. It is designed for researchers and developers working with Russian NLP tasks, offering convenient access to diverse datasets for training and evaluation.

How It Works

Corus acts as a unified interface to numerous Russian text datasets. It abstracts away the complexities of downloading, decompressing, and parsing different file formats (CSV, JSON, XML, TXT, etc.) and data structures. Each supported corpus has a dedicated loader function (e.g., load_lenta, load_wiki) that yields records, typically containing text, title, topic, and tags, enabling straightforward iteration and processing.

Quick Start & Requirements

  • Install: pip install corus
  • Prerequisites: Python 3.5+ or PyPy 3.
  • Datasets are often large and require manual download via provided wget commands or other methods.
  • Documentation: https://natasha.github.io/corus/

Highlighted Details

  • Supports over 20 distinct Russian corpora, including news archives (Lenta.ru, RIA Novosti), fiction (Lib.rus.ec), social media (Twitter), Wikipedia, and specialized datasets for Named Entity Recognition (NER) and morphological analysis.
  • Provides loaders for various NLP tasks, such as text classification, NER, and word embeddings.
  • Includes datasets with manual annotations for NER (factRuEval-2016, WiNER) and morphological analysis (Universal Dependencies).
  • Offers access to large-scale datasets like Taiga (489.62 GB) and Omnia Russica.

Maintenance & Community

Licensing & Compatibility

  • The corus library itself appears to be under a permissive license, but the underlying datasets have varying licenses. Users must adhere to the specific license terms for each corpus they download and use. Compatibility for commercial use depends on the individual dataset licenses.

Limitations & Caveats

  • Many datasets are very large and require significant disk space and download time.
  • Some datasets require manual download steps beyond simple wget commands.
  • Documentation and dataset descriptions are primarily in Russian.
Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.