AIGC_text_detector  by YuchuanTian

Research paper code for AIGC text detection using positive-unlabeled learning

created 2 years ago
291 stars

Top 91.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides official code for "Multiscale Positive-Unlabeled Detection of AI-Generated Texts," an ICLR'24 Spotlight paper. It addresses the challenge of detecting AI-generated text using a novel Positive-Unlabeled (PU) learning approach, offering high accuracy for researchers and developers working on AI content authenticity and security.

How It Works

The project implements a multiscale Positive-Unlabeled detection strategy. It leverages PU learning, a technique suitable for scenarios where only positive (AI-generated) and unlabeled data are available, by treating human-generated text as unlabeled. The approach uses models like RoBERTa and BERT, with specific training strategies and data preprocessing (like redundant space removal) to enhance detection performance across different text granularities (full documents vs. sentences).

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Download datasets to ./data and NLTK's punkt package.
  • Training requires NVIDIA GPU with CUDA.
  • Official HuggingFace demos and APIs are available.
  • Paper: https://arxiv.org/pdf/2305.18149.pdf

Highlighted Details

  • Achieves 98.40% accuracy on HC3-Full-En and 85.31% on HC3-Sent-En with RoBERTa (seed average).
  • Offers both English (en v2) and Chinese (zhv2) detectors, with the latter matching SOTA closed-source performance.
  • Includes data cleaning scripts and pre-trained models on HuggingFace.
  • Utilizes a dual_softmax_dyn_dtrun PU learning strategy with data augmentation.

Maintenance & Community

The project is associated with the ICLR'24 conference and has released models on HuggingFace. Future updates are planned to align with the latest LLMs.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, it acknowledges referencing OpenAI's gpt-2-output-dataset repository, which is typically under a permissive license. Compatibility for commercial use would require explicit license confirmation.

Limitations & Caveats

The README mentions that for experiments in the paper, the original HC3 dataset should be used, while a cleaned version is provided separately. The specific license for the code and models is not clearly defined, which may impact commercial adoption.

Health Check
Last commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
52 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.