corus by natasha

Python SDK for Russian text corpora

Created 6 years ago

309 stars

Top 87.1% on SourcePulse

Project Summary

This library provides Python functions for loading and parsing a wide variety of Russian language text corpora, totaling over 350GB. It is designed for researchers and developers working with Russian NLP tasks, offering convenient access to diverse datasets for training and evaluation.

How It Works

Corus acts as a unified interface to numerous Russian text datasets. It abstracts away the complexities of downloading, decompressing, and parsing different file formats (CSV, JSON, XML, TXT, etc.) and data structures. Each supported corpus has a dedicated loader function (e.g., load_lenta, load_wiki) that yields records, typically containing text, title, topic, and tags, enabling straightforward iteration and processing.

Quick Start & Requirements

Install: pip install corus
Prerequisites: Python 3.5+ or PyPy 3.
Datasets are often large and require manual download via provided wget commands or other methods.
Documentation: https://natasha.github.io/corus/

Highlighted Details

Supports over 20 distinct Russian corpora, including news archives (Lenta.ru, RIA Novosti), fiction (Lib.rus.ec), social media (Twitter), Wikipedia, and specialized datasets for Named Entity Recognition (NER) and morphological analysis.
Provides loaders for various NLP tasks, such as text classification, NER, and word embeddings.
Includes datasets with manual annotations for NER (factRuEval-2016, WiNER) and morphological analysis (Universal Dependencies).
Offers access to large-scale datasets like Taiga (489.62 GB) and Omnia Russica.

Maintenance & Community

Active development by the "natasha" team.
Support channels: Telegram (https://t.me/natural_language_processing) and GitHub Issues (https://github.com/natasha/corus/issues).
Clear contribution guidelines for adding new sources.

Licensing & Compatibility

The corus library itself appears to be under a permissive license, but the underlying datasets have varying licenses. Users must adhere to the specific license terms for each corpus they download and use. Compatibility for commercial use depends on the individual dataset licenses.

Limitations & Caveats

Many datasets are very large and require significant disk space and download time.
Some datasets require manual download steps beyond simple wget commands.
Documentation and dataset descriptions are primarily in Russian.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

1 stars in the last 30 days

Explore Similar Projects

awesome-hungarian-nlp by oroszgy

NLP resource list for Hungarian

Created 8 years ago

Updated 5 months ago

KeywordGacha by neavo

AI-powered tool for generating terminology glossaries from text

Created 1 year ago

Updated 7 months ago

Portuguese-NLP by ajdavidl

NLP resources and tools focused on Portuguese

Created 3 years ago

Updated 6 months ago

parsbert by hooshvare

Persian language model based on Google's BERT architecture

Created 5 years ago

Updated 2 years ago

awsome-vietnamese-nlp by vndee

NLP resources for Vietnamese

Created 5 years ago

Updated 2 months ago

Awesome-Indonesia-NLP by irfnrdh

NLP resource list for Bahasa Indonesia

Created 6 years ago

Updated 6 years ago

indicnlp_catalog by AI4Bharat

NLP resource catalog for Indic languages

Created 6 years ago

Updated 1 year ago

ruby-nlp by diasks2

Ruby NLP resource list

Created 10 years ago

Updated 2 years ago

NLP_bahasa_resources by louisowen6

Curated list of NLP datasets/libraries for Bahasa Indonesia

Created 5 years ago

Updated 2 years ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and

1 more.

nlp_chinese_corpus by brightmart

Chinese NLP corpus for pre-training and language model tasks

Created 7 years ago

Updated 4 months ago

Starred by

Boris Cherny

Boris Cherny(Creator of Claude Code; MTS at Anthropic),

Andrew Kane

Andrew Kane(Author of pgvector), and

8 more.

awesome-nlp by keon

Curated list of NLP resources

Created 10 years ago

Updated 1 week ago

Starred by

Li Jiang

Li Jiang(Coauthor of AutoGen; Engineer at Microsoft) and

Siyuan Zhuang

Siyuan Zhuang(Coauthor of vLLM).

funNLP by fighting41love

NLP resources for various tasks

Created 7 years ago

Updated 1 year ago

Feedback? Help us improve.