chatgpt-comparison-detection  by Hello-SimpleAI

ChatGPT detection research paper and corpus

Created 2 years ago
1,324 stars

Top 30.3% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides tools and datasets for detecting ChatGPT-generated content, targeting researchers and developers interested in AI safety and content authenticity. It offers bilingual (English/Chinese) detectors and a comprehensive comparison corpus (HC3) to evaluate and differentiate AI-generated text from human expert responses.

How It Works

The project leverages pre-trained language models (PLMs) like RoBERTa for its primary detectors. It offers two main types of PLM-based classifiers: one for question-answer pairs and another for single text instances. A third detector utilizes linguistic features. This approach allows for nuanced detection by considering contextual Q&A relationships or standalone text characteristics, aiming for higher accuracy than generic methods.

Quick Start & Requirements

  • Models are available on Hugging Face Hub.
  • English models are based on roberta-base.
  • Chinese models are based on hfl/chinese-roberta-wwm-ext.
  • The HC3 dataset is available on Hugging Face Datasets and ModelScope.

Highlighted Details

  • Provides a Human ChatGPT Comparison Corpus (HC3) in both English and Chinese.
  • Offers three types of detectors: QA version, Single-text version, and Linguistic version.
  • All detectors are bilingual, supporting both English and Chinese.
  • Models are based on RoBERTa architectures.

Maintenance & Community

The project was initiated shortly after ChatGPT's launch (December 2022) and has seen active development with releases of datasets, models, and demos. The team comprises PhD students and engineers from multiple institutions. Feedback is encouraged via a dedicated space.

Licensing & Compatibility

Dataset licenses vary by source, with some following CC-BY-SA, CC-BY-NC 4.0, CC-BY 4.0, CC0, MIT, or BSD, while others require direct inquiry. This mixed licensing may impact commercial use or integration into closed-source projects.

Limitations & Caveats

The licensing for some dataset components is listed as "Unknown" or requires direct contact, potentially posing adoption challenges for commercial applications. The project is research-oriented, and specific performance benchmarks or production-readiness details are not explicitly stated.

Health Check
Last Commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Elvis Saravia Elvis Saravia(Founder of DAIR.AI), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
3 more.

nlp-library by mihail911

0.1%
1k
NLP papers for practitioners
Created 8 years ago
Updated 5 years ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Eugene Yan Eugene Yan(AI Scientist at AWS), and
14 more.

text by pytorch

0.0%
4k
PyTorch library for NLP tasks
Created 8 years ago
Updated 1 week ago
Feedback? Help us improve.