Python SDK for Russian text corpora
Top 89.7% on sourcepulse
This library provides Python functions for loading and parsing a wide variety of Russian language text corpora, totaling over 350GB. It is designed for researchers and developers working with Russian NLP tasks, offering convenient access to diverse datasets for training and evaluation.
How It Works
Corus acts as a unified interface to numerous Russian text datasets. It abstracts away the complexities of downloading, decompressing, and parsing different file formats (CSV, JSON, XML, TXT, etc.) and data structures. Each supported corpus has a dedicated loader function (e.g., load_lenta
, load_wiki
) that yields records, typically containing text, title, topic, and tags, enabling straightforward iteration and processing.
Quick Start & Requirements
pip install corus
wget
commands or other methods.Highlighted Details
Maintenance & Community
Licensing & Compatibility
corus
library itself appears to be under a permissive license, but the underlying datasets have varying licenses. Users must adhere to the specific license terms for each corpus they download and use. Compatibility for commercial use depends on the individual dataset licenses.Limitations & Caveats
wget
commands.2 years ago
1 day