chatgpt-comparison-detection by Hello-SimpleAI

ChatGPT detection research paper and corpus

Created 3 years ago

1,342 stars

Top 29.5% on SourcePulse

View on GitHub

3 Experts Love This Project

Kaichao You

Core Maintainer of vLLM

Omar Sanseviero

DevRel at Google DeepMind

Junyang Lin

Core Maintainer at Alibaba Qwen

Project Summary

This project provides tools and datasets for detecting ChatGPT-generated content, targeting researchers and developers interested in AI safety and content authenticity. It offers bilingual (English/Chinese) detectors and a comprehensive comparison corpus (HC3) to evaluate and differentiate AI-generated text from human expert responses.

How It Works

The project leverages pre-trained language models (PLMs) like RoBERTa for its primary detectors. It offers two main types of PLM-based classifiers: one for question-answer pairs and another for single text instances. A third detector utilizes linguistic features. This approach allows for nuanced detection by considering contextual Q&A relationships or standalone text characteristics, aiming for higher accuracy than generic methods.

Quick Start & Requirements

Models are available on Hugging Face Hub.
English models are based on roberta-base.
Chinese models are based on hfl/chinese-roberta-wwm-ext.
The HC3 dataset is available on Hugging Face Datasets and ModelScope.

Highlighted Details

Provides a Human ChatGPT Comparison Corpus (HC3) in both English and Chinese.
Offers three types of detectors: QA version, Single-text version, and Linguistic version.
All detectors are bilingual, supporting both English and Chinese.
Models are based on RoBERTa architectures.

Maintenance & Community

The project was initiated shortly after ChatGPT's launch (December 2022) and has seen active development with releases of datasets, models, and demos. The team comprises PhD students and engineers from multiple institutions. Feedback is encouraged via a dedicated space.

Licensing & Compatibility

Dataset licenses vary by source, with some following CC-BY-SA, CC-BY-NC 4.0, CC-BY 4.0, CC0, MIT, or BSD, while others require direct inquiry. This mixed licensing may impact commercial use or integration into closed-source projects.

Limitations & Caveats

The licensing for some dataset components is listed as "Unknown" or requires direct contact, potentially posing adoption challenges for commercial applications. The project is research-oriented, and specific performance benchmarks or production-readiness details are not explicitly stated.

Health Check

Last Commit

2 years ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days