chatgpt-comparison-detection  by Hello-SimpleAI

ChatGPT detection research paper and corpus

created 2 years ago
1,321 stars

Top 31.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides tools and datasets for detecting ChatGPT-generated content, targeting researchers and developers interested in AI safety and content authenticity. It offers bilingual (English/Chinese) detectors and a comprehensive comparison corpus (HC3) to evaluate and differentiate AI-generated text from human expert responses.

How It Works

The project leverages pre-trained language models (PLMs) like RoBERTa for its primary detectors. It offers two main types of PLM-based classifiers: one for question-answer pairs and another for single text instances. A third detector utilizes linguistic features. This approach allows for nuanced detection by considering contextual Q&A relationships or standalone text characteristics, aiming for higher accuracy than generic methods.

Quick Start & Requirements

  • Models are available on Hugging Face Hub.
  • English models are based on roberta-base.
  • Chinese models are based on hfl/chinese-roberta-wwm-ext.
  • The HC3 dataset is available on Hugging Face Datasets and ModelScope.

Highlighted Details

  • Provides a Human ChatGPT Comparison Corpus (HC3) in both English and Chinese.
  • Offers three types of detectors: QA version, Single-text version, and Linguistic version.
  • All detectors are bilingual, supporting both English and Chinese.
  • Models are based on RoBERTa architectures.

Maintenance & Community

The project was initiated shortly after ChatGPT's launch (December 2022) and has seen active development with releases of datasets, models, and demos. The team comprises PhD students and engineers from multiple institutions. Feedback is encouraged via a dedicated space.

Licensing & Compatibility

Dataset licenses vary by source, with some following CC-BY-SA, CC-BY-NC 4.0, CC-BY 4.0, CC0, MIT, or BSD, while others require direct inquiry. This mixed licensing may impact commercial use or integration into closed-source projects.

Limitations & Caveats

The licensing for some dataset components is listed as "Unknown" or requires direct contact, potentially posing adoption challenges for commercial applications. The project is research-oriented, and specific performance benchmarks or production-readiness details are not explicitly stated.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
23 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.