chatgpt-comparison-detection - Comparison and Detection of Human vs. AI Generated Content in HC3 Corpus

ChatGPT-Comparison-Detection Project 🔬

The ChatGPT-Comparison-Detection project is an innovative initiative stemming from the official paper titled "How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection". This project seeks to bridge the gap between human-generated content and that produced by artificial intelligence, specifically ChatGPT, by providing tools and datasets for comparison and detection.

Human ChatGPT Comparison Corpus (HC3)

At the heart of the project lies the Human vs. ChatGPT Comparison Corpus, known as HC3. This corpus represents the pioneering effort to compile a comprehensive dataset comparing responses generated by humans and ChatGPT, available in both English and Chinese. The initial versions of these datasets can be accessed on platforms like Hugging Face and ModelScope. They serve not only as a resource for research but also as a foundation for developing detection models.

Dataset Copyright

The project adheres to copyright licenses aligned with the original datasets from which HC3 draws its data. If a dataset's license is stricter than the Creative Commons Attribution-ShareAlike (CC-BY-SA), the HC3 corpus honors that. Otherwise, it falls under the CC-BY-SA license, ensuring a balance between open access and respect for original data sources.

ChatGPT Detectors

Detection of AI-generated content is facilitated through specialized models called ChatGPT detectors. The project offers three types of detectors, all of which support English and Chinese:

QA Version - Detects if an answer to a specific question is generated by ChatGPT using pretrained machine learning models (PLM-based classifiers).
Single-text Version - Evaluates standalone text for ChatGPT generation using PLM-based models.
Linguistic Version - Utilizes linguistic features to determine the origin of standalone text.

These detectors are hosted on Hugging Face Spaces and are also available on ModelScope within the Chinese community. The models leverage RoBERTa architecture, tailored separately for English and Chinese texts.

Important Dates

This ambitious project was initiated on December 9, 2022, shortly after the launch of ChatGPT. Key milestones include:

Launch of the comparison data collection.
Release of the ChatGPT Detector's demo in January 2023.
Open sourcing of models and the comparison corpus.
Publication of the accompanying research paper.

Our Story

The origin and continuation of the ChatGPT-Comparison-Detection project are rooted in the desire to contribute significantly to AI literacy and evaluative research. By developing open-source tools to detect AI-generated content and compiling invaluable datasets, the project team, consisting of PhD students and engineers from diverse institutions, aims to propel academic inquiry into the capabilities of AI-generated content.

About Us

The project's team includes researchers and engineers who share a commitment to meaningful research that benefits the wider community. The members come from diverse academic and professional backgrounds, which enriches the project's scope and impact.

For more detailed insights or to contribute feedback, interested parties are encouraged to engage with the project's repository and join the ongoing discussion to refine and enhance these tools and resources.