corus  by natasha

Python SDK for Russian text corpora

Created 6 years ago
302 stars

Top 88.4% on SourcePulse

GitHubView on GitHub
Project Summary

This library provides Python functions for loading and parsing a wide variety of Russian language text corpora, totaling over 350GB. It is designed for researchers and developers working with Russian NLP tasks, offering convenient access to diverse datasets for training and evaluation.

How It Works

Corus acts as a unified interface to numerous Russian text datasets. It abstracts away the complexities of downloading, decompressing, and parsing different file formats (CSV, JSON, XML, TXT, etc.) and data structures. Each supported corpus has a dedicated loader function (e.g., load_lenta, load_wiki) that yields records, typically containing text, title, topic, and tags, enabling straightforward iteration and processing.

Quick Start & Requirements

  • Install: pip install corus
  • Prerequisites: Python 3.5+ or PyPy 3.
  • Datasets are often large and require manual download via provided wget commands or other methods.
  • Documentation: https://natasha.github.io/corus/

Highlighted Details

  • Supports over 20 distinct Russian corpora, including news archives (Lenta.ru, RIA Novosti), fiction (Lib.rus.ec), social media (Twitter), Wikipedia, and specialized datasets for Named Entity Recognition (NER) and morphological analysis.
  • Provides loaders for various NLP tasks, such as text classification, NER, and word embeddings.
  • Includes datasets with manual annotations for NER (factRuEval-2016, WiNER) and morphological analysis (Universal Dependencies).
  • Offers access to large-scale datasets like Taiga (489.62 GB) and Omnia Russica.

Maintenance & Community

Licensing & Compatibility

  • The corus library itself appears to be under a permissive license, but the underlying datasets have varying licenses. Users must adhere to the specific license terms for each corpus they download and use. Compatibility for commercial use depends on the individual dataset licenses.

Limitations & Caveats

  • Many datasets are very large and require significant disk space and download time.
  • Some datasets require manual download steps beyond simple wget commands.
  • Documentation and dataset descriptions are primarily in Russian.
Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Andrew Kane Andrew Kane(Author of pgvector), and
8 more.

awesome-nlp by keon

0.1%
18k
Curated list of NLP resources
Created 9 years ago
Updated 5 days ago
Feedback? Help us improve.