clean-dialog  by lemon234071

Chinese dialog cleaning framework

created 4 years ago
274 stars

Top 94.3% on SourcePulse

GitHubView on GitHub
Project Summary

This framework provides a multi-threaded solution for cleaning Chinese dialogue data from platforms like Zhihu, Weibo, and Tieba. It targets researchers and developers working with large-scale conversational datasets, offering a configurable pipeline to remove noise, PII, and repetitive content, thereby improving data quality for downstream NLP tasks.

How It Works

The system employs a modular, rule-based approach executed across multiple threads. It loads data via custom inputters, applies a series of configurable cleaning rules (including blacklist filtering, PII removal, URL stripping, deduplication, and ad/generic response filtering), and utilizes a thread pool for parallel processing. Rules can either modify sentences in-place or discard entire dialogues/utterances if the noise cannot be removed, with options to segment multi-turn conversations.

Quick Start & Requirements

  • Install/Run: Execute bash ./scripts/run.sh 2>&1 | tee -a cleaning.log.
  • Prerequisites: Python, multi-threading support. Specific dependencies for certain cleaning functions (e.g., bert_clean, cleantext_clean) are not explicitly detailed but may require additional libraries. Blacklists can be sourced from external repositories like fighting41love/funNLP.
  • Configuration: Key parameters include n_p (processes), batch_size, tool_dir (for blacklists), out_dir, raw_dir, and dirty_dir.
  • Documentation: Test cases and expected outputs are planned for each function.

Highlighted Details

  • Supports a wide array of cleaning rules, including PII masking (names to NAME1, NAME2), URL removal, emoji handling, and various forms of deduplication (phrase, context, dialogue).
  • Offers options to remove advertisements, generic responses, and short/long utterances.
  • Includes specific rules for cleaning Weibo reposts and mentions.
  • Provides functionality to segment multi-turn dialogues based on problematic utterances.

Maintenance & Community

The project is described as "still quite rudimentary" and welcomes bug reports and optimizations. The author plans to add more comments and citations. No specific community channels or contributor information are provided in the README.

Licensing & Compatibility

The license is not specified in the README. Compatibility for commercial use or closed-source linking is therefore undetermined.

Limitations & Caveats

The framework is explicitly stated to be in early development ("still quite rudimentary," "code is still being improved"). The effectiveness of some deduplication rules (e.g., phrase deduplication) is noted as needing optimization. The order of rule application is critical and may impact results.

Health Check
Last commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.