Discover and explore top open-source AI tools and projects—updated daily.
Chinese text normalization for speech processing
Top 48.2% on SourcePulse
<The chinese_text_normalization
repository provides a ready-to-use module for normalizing Chinese text, specifically tailored for Automatic Speech Recognition (ASR) post-processing pipelines. It addresses the scarcity of accessible, language-specific text normalization tools by offering a robust solution for common normalization tasks, benefiting developers and researchers in Chinese speech processing.>
How It Works
The project implements several normalization types, including Non-Standard Word (NSW) normalization for categories like dates, numbers, and money, as well as punctuation removal and English word case conversion. NSW normalization leverages regular expressions, while punctuation removal utilizes predefined lists for both Chinese and English. The system is designed for flexibility, supporting plain text, Kaldi archive, and TSV formats, ensuring compatibility with various ASR data pipelines.
Quick Start & Requirements
run.sh
script in the TN
directory can be used for examples.thrax
must be installed, and its binaries must be in the system's PATH. A Makefile dependency for thrax
grammar is also mentioned.Highlighted Details
.ark
), and TSV formats.Maintenance & Community
The repository's author indicates that updates may be infrequent, as the current state is sufficient for their purposes. Future improvements are suggested for refining NSW regular expressions and extending ITN grammars. The project acknowledges work by Zhiyang Zhou for NSW normalization codes and points to research by Richard Sproat and Kyle Gorman for model-based approaches.
Licensing & Compatibility
The repository's license is not explicitly stated in the provided README.
Limitations & Caveats
The NSW normalizers are based on regular expressions, which may lead to unintended matches requiring refinement. The author notes that a mixed rule-based and model-based system represents the state-of-the-art in text normalization, suggesting potential areas for future development beyond the current rule-based approach.
2 years ago
Inactive