efaqa-corpus-zh by chatopera

Chinese psychological counseling Q&A corpus for AI

Created 6 years ago

764 stars

Top 44.9% on SourcePulse

Project Summary

This repository provides the Emotional First Aid Dataset (efaqa-corpus-zh), a large-scale, annotated Chinese corpus for psychological counseling Q&A. It addresses the need for AI-driven mental health support and LLM fine-tuning by offering rich, multi-turn dialogue data. The dataset is primarily for researchers and developers in the AI and psychology domains.

How It Works

The dataset comprises 20,000 manually annotated multi-turn dialogue entries from psychological counseling sessions. Each entry is meticulously labeled across three severity dimensions: distress type (s1), psychological disorder (s2), and emergency level (s3). A raw, larger dataset is also available for unsupervised LLM training. The annotation process involved significant time and effort, averaging over one minute per entry, to ensure detailed conversational context and classification.

Quick Start & Requirements

Primary install: pip install -U efaqa-corpus-zh
Prerequisites: Python 2.x or 3.x, Pip.
Data Acquisition: Requires purchasing a license/certificate from "证书商店" (Certificate Store) and setting the EFAQA_DL_LICENSE environment variable with your certificate identifier.
Usage: Load data via import efaqa_corpus_zh and records = list(efaqa_corpus_zh.load()).
Links: 首页, 媒体报道, 未来之路.

Highlighted Details

Contains 20,000 annotated multi-turn dialogue entries, representing the largest publicly available Chinese psychological counseling dialogue corpus as of April 2022.
Features a comprehensive multi-dimensional labeling system (s1, s2, s3) for classifying distress types, potential psychological disorders, and emergency levels.
Detailed data format includes sender, message type, timestamp, content, and specific labels for conversation turns (question, knowledge, negative).
Developed in collaboration with academic institutions (Stanford, UCLA, Fu Jen Catholic University) and industry professionals.

Maintenance & Community

The project is a collaboration involving academic institutions and Chatopera Inc. Support and issue reporting are handled via GitHub issues: https://github.com/chatopera/docs/issues. Volunteer contributors from multiple countries participated in data annotation.

Licensing & Compatibility

The dataset is distributed under the "春松许可证，v1.0" (ChunSong License, v1.0). Crucially, the data is strictly for research purposes only. Commercial use is prohibited and will be pursued legally.

Limitations & Caveats

The corpus is subjectively annotated and cannot be guaranteed 100% accurate; the team disclaims liability for consequences arising from data content. Extremely complex psychological disorders are not covered due to annotation difficulty. A significant adoption barrier is the requirement to purchase a license to download and use the data, and its strict non-commercial use restriction makes it incompatible with commercial applications.

efaqa-corpus-zh by chatopera

Explore Similar Projects

finetuned-qlora-falcon7b-medical by iamarunbrahma

EMPaper by Sahandfer

CharacterGLM-6B by thu-coai

MentalLLaMA by SteveKGYang

z-bench by zhenbench

smile by qiuhuachuan

MindChat by X-D-Lab

EmpatheticDialogues by facebookresearch

SoulChat by scutcyr

BianQue by scutcyr

rasa_chatbot_cn by GaoQ1

awesome-chatgpt-zh by EmbraceAGI