chinese_text_normalization  by speechio

Chinese text normalization for speech processing

Created 6 years ago
709 stars

Top 48.2% on SourcePulse

GitHubView on GitHub
Project Summary

<The chinese_text_normalization repository provides a ready-to-use module for normalizing Chinese text, specifically tailored for Automatic Speech Recognition (ASR) post-processing pipelines. It addresses the scarcity of accessible, language-specific text normalization tools by offering a robust solution for common normalization tasks, benefiting developers and researchers in Chinese speech processing.>

How It Works

The project implements several normalization types, including Non-Standard Word (NSW) normalization for categories like dates, numbers, and money, as well as punctuation removal and English word case conversion. NSW normalization leverages regular expressions, while punctuation removal utilizes predefined lists for both Chinese and English. The system is designed for flexibility, supporting plain text, Kaldi archive, and TSV formats, ensuring compatibility with various ASR data pipelines.

Quick Start & Requirements

  • Installation: Requires Python 3. The run.sh script in the TN directory can be used for examples.
  • Dependencies: For Inverse Text Normalization (ITN), thrax must be installed, and its binaries must be in the system's PATH. A Makefile dependency for thrax grammar is also mentioned.
  • Input Format: All input text must be UTF-8 encoded.

Highlighted Details

  • Supports normalization of cardinal numbers, dates, digits, fractions, money, percentages, and telephone numbers.
  • Includes punctuation removal for both Chinese and English, using curated lists of stop and non-stop punctuation.
  • Handles English word case conversion, adapting to ASR/TTS lexicon conventions.
  • Accepts input in plain text (one sentence per line), Kaldi archive (.ark), and TSV formats.

Maintenance & Community

The repository's author indicates that updates may be infrequent, as the current state is sufficient for their purposes. Future improvements are suggested for refining NSW regular expressions and extending ITN grammars. The project acknowledges work by Zhiyang Zhou for NSW normalization codes and points to research by Richard Sproat and Kyle Gorman for model-based approaches.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README.

Limitations & Caveats

The NSW normalizers are based on regular expressions, which may lead to unintended matches requiring refinement. The author notes that a mixed rule-based and model-based system represents the state-of-the-art in text normalization, suggesting potential areas for future development beyond the current rule-based approach.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

pycorrector by shibing624

0.2%
6k
Toolkit for text error correction, supports multiple models for Chinese
Created 7 years ago
Updated 1 week ago
Feedback? Help us improve.