Research paper code for AIGC text detection using positive-unlabeled learning
Top 91.6% on sourcepulse
This repository provides official code for "Multiscale Positive-Unlabeled Detection of AI-Generated Texts," an ICLR'24 Spotlight paper. It addresses the challenge of detecting AI-generated text using a novel Positive-Unlabeled (PU) learning approach, offering high accuracy for researchers and developers working on AI content authenticity and security.
How It Works
The project implements a multiscale Positive-Unlabeled detection strategy. It leverages PU learning, a technique suitable for scenarios where only positive (AI-generated) and unlabeled data are available, by treating human-generated text as unlabeled. The approach uses models like RoBERTa and BERT, with specific training strategies and data preprocessing (like redundant space removal) to enhance detection performance across different text granularities (full documents vs. sentences).
Quick Start & Requirements
pip install -r requirements.txt
./data
and NLTK's punkt
package.Highlighted Details
dual_softmax_dyn_dtrun
PU learning strategy with data augmentation.Maintenance & Community
The project is associated with the ICLR'24 conference and has released models on HuggingFace. Future updates are planned to align with the latest LLMs.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. However, it acknowledges referencing OpenAI's gpt-2-output-dataset
repository, which is typically under a permissive license. Compatibility for commercial use would require explicit license confirmation.
Limitations & Caveats
The README mentions that for experiments in the paper, the original HC3 dataset should be used, while a cleaned version is provided separately. The specific license for the code and models is not clearly defined, which may impact commercial adoption.
1 week ago
1 week