chinese_text_normalization by speechio

Chinese text normalization for speech processing

Created 6 years ago

718 stars

Top 47.9% on SourcePulse

Project Summary

How It Works

The project implements several normalization types, including Non-Standard Word (NSW) normalization for categories like dates, numbers, and money, as well as punctuation removal and English word case conversion. NSW normalization leverages regular expressions, while punctuation removal utilizes predefined lists for both Chinese and English. The system is designed for flexibility, supporting plain text, Kaldi archive, and TSV formats, ensuring compatibility with various ASR data pipelines.

Quick Start & Requirements

Installation: Requires Python 3. The run.sh script in the TN directory can be used for examples.
Dependencies: For Inverse Text Normalization (ITN), thrax must be installed, and its binaries must be in the system's PATH. A Makefile dependency for thrax grammar is also mentioned.
Input Format: All input text must be UTF-8 encoded.

Highlighted Details

Supports normalization of cardinal numbers, dates, digits, fractions, money, percentages, and telephone numbers.
Includes punctuation removal for both Chinese and English, using curated lists of stop and non-stop punctuation.
Handles English word case conversion, adapting to ASR/TTS lexicon conventions.
Accepts input in plain text (one sentence per line), Kaldi archive (.ark), and TSV formats.

Maintenance & Community

The repository's author indicates that updates may be infrequent, as the current state is sufficient for their purposes. Future improvements are suggested for refining NSW regular expressions and extending ITN grammars. The project acknowledges work by Zhiyang Zhou for NSW normalization codes and points to research by Richard Sproat and Kyle Gorman for model-based approaches.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README.

Limitations & Caveats

The NSW normalizers are based on regular expressions, which may lead to unintended matches requiring refinement. The author notes that a mixed rule-based and model-based system represents the state-of-the-art in text normalization, suggesting potential areas for future development beyond the current rule-based approach.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days