efaqa-corpus-zh  by chatopera

Chinese psychological counseling Q&A corpus for AI

Created 6 years ago
747 stars

Top 46.3% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the Emotional First Aid Dataset (efaqa-corpus-zh), a large-scale, annotated Chinese corpus for psychological counseling Q&A. It addresses the need for AI-driven mental health support and LLM fine-tuning by offering rich, multi-turn dialogue data. The dataset is primarily for researchers and developers in the AI and psychology domains.

How It Works

The dataset comprises 20,000 manually annotated multi-turn dialogue entries from psychological counseling sessions. Each entry is meticulously labeled across three severity dimensions: distress type (s1), psychological disorder (s2), and emergency level (s3). A raw, larger dataset is also available for unsupervised LLM training. The annotation process involved significant time and effort, averaging over one minute per entry, to ensure detailed conversational context and classification.

Quick Start & Requirements

  • Primary install: pip install -U efaqa-corpus-zh
  • Prerequisites: Python 2.x or 3.x, Pip.
  • Data Acquisition: Requires purchasing a license/certificate from "证书商店" (Certificate Store) and setting the EFAQA_DL_LICENSE environment variable with your certificate identifier.
  • Usage: Load data via import efaqa_corpus_zh and records = list(efaqa_corpus_zh.load()).
  • Links: 首页, 媒体报道, 未来之路.

Highlighted Details

  • Contains 20,000 annotated multi-turn dialogue entries, representing the largest publicly available Chinese psychological counseling dialogue corpus as of April 2022.
  • Features a comprehensive multi-dimensional labeling system (s1, s2, s3) for classifying distress types, potential psychological disorders, and emergency levels.
  • Detailed data format includes sender, message type, timestamp, content, and specific labels for conversation turns (question, knowledge, negative).
  • Developed in collaboration with academic institutions (Stanford, UCLA, Fu Jen Catholic University) and industry professionals.

Maintenance & Community

The project is a collaboration involving academic institutions and Chatopera Inc. Support and issue reporting are handled via GitHub issues: https://github.com/chatopera/docs/issues. Volunteer contributors from multiple countries participated in data annotation.

Licensing & Compatibility

The dataset is distributed under the "春松许可证,v1.0" (ChunSong License, v1.0). Crucially, the data is strictly for research purposes only. Commercial use is prohibited and will be pursued legally.

Limitations & Caveats

The corpus is subjectively annotated and cannot be guaranteed 100% accurate; the team disclaims liability for consequences arising from data content. Extremely complex psychological disorders are not covered due to annotation difficulty. A significant adoption barrier is the requirement to purchase a license to download and use the data, and its strict non-commercial use restriction makes it incompatible with commercial applications.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Kaichao You Kaichao You(Core Maintainer of vLLM), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

z-bench by zhenbench

0%
503
Chinese LLM prompt dataset for non-technical users
Created 3 years ago
Updated 2 years ago
Feedback? Help us improve.