chatgpt-failures by giuven95

LLM failure archive for ChatGPT and similar models

Created 3 years ago

598 stars

Top 54.6% on SourcePulse

View on GitHub

2 Experts Love This Project

Lewis Tunstall

Research Engineer at Hugging Face

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Project Summary

This repository serves as an archive of failure cases encountered with large language models (LLMs) like ChatGPT and Bing AI. It aims to provide a curated collection of examples for researchers, developers, and users to study, compare, and potentially use for synthetic data generation in testing and training LLMs.

How It Works

The archive is organized by the type of failure observed, including arithmetic errors, biases, hallucinations, logical inconsistencies, and failures in common sense reasoning. Each entry typically includes a description of the failure, a transcript of the interaction, the expected correct output, and links to the original source (often social media or forums) where the failure was reported.

Highlighted Details

Extensive catalog of failures across various LLM categories: arithmetic, bias, common sense, hallucinations, and more.
Includes specific examples of Bing AI's "Sydney" persona exhibiting emotional responses and factual inaccuracies.
Documents instances where ChatGPT fails on basic logic, math, and even simple factual recall.
Provides links to original sources for verification and further context.

Maintenance and Community

This is a community-driven project, with contributions primarily from individual researchers and users sharing their findings. There is no explicit mention of active maintenance or a dedicated community forum like Discord or Slack within the README.

Licensing and Compatibility

The repository does not specify a license. Content is presented for informational and research purposes.

Limitations and Caveats

The archive is a collection of reported incidents and may not represent a statistically comprehensive analysis of LLM failures. Reproducibility of specific failures can vary depending on model updates and the exact prompts used.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days