AIGC_text_detector by YuchuanTian

Research paper code for AIGC text detection using positive-unlabeled learning

Created 2 years ago

362 stars

Top 77.6% on SourcePulse

Project Summary

This repository provides official code for "Multiscale Positive-Unlabeled Detection of AI-Generated Texts," an ICLR'24 Spotlight paper. It addresses the challenge of detecting AI-generated text using a novel Positive-Unlabeled (PU) learning approach, offering high accuracy for researchers and developers working on AI content authenticity and security.

How It Works

The project implements a multiscale Positive-Unlabeled detection strategy. It leverages PU learning, a technique suitable for scenarios where only positive (AI-generated) and unlabeled data are available, by treating human-generated text as unlabeled. The approach uses models like RoBERTa and BERT, with specific training strategies and data preprocessing (like redundant space removal) to enhance detection performance across different text granularities (full documents vs. sentences).

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Download datasets to ./data and NLTK's punkt package.
Training requires NVIDIA GPU with CUDA.
Official HuggingFace demos and APIs are available.
Paper: https://arxiv.org/pdf/2305.18149.pdf

Highlighted Details

Achieves 98.40% accuracy on HC3-Full-En and 85.31% on HC3-Sent-En with RoBERTa (seed average).
Offers both English (en v2) and Chinese (zhv2) detectors, with the latter matching SOTA closed-source performance.
Includes data cleaning scripts and pre-trained models on HuggingFace.
Utilizes a dual_softmax_dyn_dtrun PU learning strategy with data augmentation.

Maintenance & Community

The project is associated with the ICLR'24 conference and has released models on HuggingFace. Future updates are planned to align with the latest LLMs.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, it acknowledges referencing OpenAI's gpt-2-output-dataset repository, which is typically under a permissive license. Compatibility for commercial use would require explicit license confirmation.

Limitations & Caveats

The README mentions that for experiments in the paper, the original HC3 dataset should be used, while a cleaned version is provided separately. The specific license for the code and models is not clearly defined, which may impact commercial adoption.

AIGC_text_detector by YuchuanTian

Explore Similar Projects

PERT by ymcui

awesome-japanese-llm by llm-jp

fancy-nlp by boat-group

bert-japanese by cl-tohoku

nlp_notes by YangBin1729

GPTZero by BurhanUlTayyab

chatgpt-comparison-detection by Hello-SimpleAI

awesome-deep-text-detection-recognition by hwalsuklee

text by pytorch

Chinese-BERT-wwm by ymcui

GPT2-Chinese by Morizeyao

text_classification by brightmart