ChatGPT detection research paper and corpus
Top 31.0% on sourcepulse
This project provides tools and datasets for detecting ChatGPT-generated content, targeting researchers and developers interested in AI safety and content authenticity. It offers bilingual (English/Chinese) detectors and a comprehensive comparison corpus (HC3) to evaluate and differentiate AI-generated text from human expert responses.
How It Works
The project leverages pre-trained language models (PLMs) like RoBERTa for its primary detectors. It offers two main types of PLM-based classifiers: one for question-answer pairs and another for single text instances. A third detector utilizes linguistic features. This approach allows for nuanced detection by considering contextual Q&A relationships or standalone text characteristics, aiming for higher accuracy than generic methods.
Quick Start & Requirements
roberta-base
.hfl/chinese-roberta-wwm-ext
.Highlighted Details
Maintenance & Community
The project was initiated shortly after ChatGPT's launch (December 2022) and has seen active development with releases of datasets, models, and demos. The team comprises PhD students and engineers from multiple institutions. Feedback is encouraged via a dedicated space.
Licensing & Compatibility
Dataset licenses vary by source, with some following CC-BY-SA, CC-BY-NC 4.0, CC-BY 4.0, CC0, MIT, or BSD, while others require direct inquiry. This mixed licensing may impact commercial use or integration into closed-source projects.
Limitations & Caveats
The licensing for some dataset components is listed as "Unknown" or requires direct contact, potentially posing adoption challenges for commercial applications. The project is research-oriented, and specific performance benchmarks or production-readiness details are not explicitly stated.
1 year ago
Inactive