Chinese-Names-Corpus by wainshine

Corpus for Chinese names and name generation

Created 9 years ago

4,314 stars

Top 11.2% on SourcePulse

Project Summary

This repository provides a comprehensive corpus of Chinese names, designed for natural language processing tasks such as Chinese word segmentation and named entity recognition. It also includes name generation capabilities and datasets for English and Japanese names, catering to researchers and developers working with multilingual name data.

How It Works

The project leverages big data and NLP techniques, processing massive text datasets to extract and clean name entities. It builds a large-scale Chinese name knowledge graph with over 56 million entries, enriched with attributes like gender, age, and sentiment. The corpus is derived from extensive data cleaning of billions of names, aiming for high accuracy in NLP applications.

Quick Start & Requirements

Install: No specific installation instructions are provided; data is likely accessed directly from the repository.
Prerequisites: Standard Python environment for data processing.
Resources: Datasets range from thousands to millions of entries, requiring sufficient disk space.
Links: GitHub Repository

Highlighted Details

Contains 1.2 million Chinese names, 250,000 ancient Chinese names, and 1,000 Chinese family names.
Includes 5,000+ Chinese relationship terms and 480,000 translated English names.
Features 180,000 Japanese names extracted from Wikipedia.
Offers a name generation tool and a Chinese idiom dictionary.

Maintenance & Community

The project was last updated on March 27, 2024. The primary contributor is "@萌名NameMoe". The README mentions the project is maintained for learning NLP, KG, and AI technologies.

Licensing & Compatibility

The repository does not explicitly state a license. Users are requested to set downloads to 0积分 and retain the GitHub link if reposting on domestic platforms.

Limitations & Caveats

While data is cleaned, the README notes the presence of "badcase" entries in several datasets, including Chinese relationship terms and translated English names. The project is primarily a data resource, with limited explicit tooling beyond the name generator.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

16 stars in the last 30 days