Chinese dialog cleaning framework
Top 94.3% on SourcePulse
This framework provides a multi-threaded solution for cleaning Chinese dialogue data from platforms like Zhihu, Weibo, and Tieba. It targets researchers and developers working with large-scale conversational datasets, offering a configurable pipeline to remove noise, PII, and repetitive content, thereby improving data quality for downstream NLP tasks.
How It Works
The system employs a modular, rule-based approach executed across multiple threads. It loads data via custom inputters, applies a series of configurable cleaning rules (including blacklist filtering, PII removal, URL stripping, deduplication, and ad/generic response filtering), and utilizes a thread pool for parallel processing. Rules can either modify sentences in-place or discard entire dialogues/utterances if the noise cannot be removed, with options to segment multi-turn conversations.
Quick Start & Requirements
bash ./scripts/run.sh 2>&1 | tee -a cleaning.log
.bert_clean
, cleantext_clean
) are not explicitly detailed but may require additional libraries. Blacklists can be sourced from external repositories like fighting41love/funNLP
.n_p
(processes), batch_size
, tool_dir
(for blacklists), out_dir
, raw_dir
, and dirty_dir
.Highlighted Details
Maintenance & Community
The project is described as "still quite rudimentary" and welcomes bug reports and optimizations. The author plans to add more comments and citations. No specific community channels or contributor information are provided in the README.
Licensing & Compatibility
The license is not specified in the README. Compatibility for commercial use or closed-source linking is therefore undetermined.
Limitations & Caveats
The framework is explicitly stated to be in early development ("still quite rudimentary," "code is still being improved"). The effectiveness of some deduplication rules (e.g., phrase deduplication) is noted as needing optimization. The order of rule application is critical and may impact results.
4 years ago
Inactive