z-bench by zhenbench

Chinese LLM prompt dataset for non-technical users

Created 3 years ago

503 stars

Top 62.1% on SourcePulse

View on GitHub

3 Experts Love This Project

Kaichao You

Core Maintainer of vLLM

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Ji Yichao

Cofounder of Manus

Project Summary

Z-Bench is a curated dataset of 300 Chinese prompts designed for non-technical users to qualitatively evaluate the conversational abilities of large language models (LLMs). Developed by Zhenfund, it aims to provide a practical, user-friendly alternative to complex academic benchmarks, focusing on real-world conversational AI performance.

How It Works

Z-Bench categorizes prompts into "Basic," "Advanced," and "Specialized" abilities, drawing from existing NLP benchmarks, user-collected examples, and observed emergent LLM capabilities. This approach prioritizes coverage of diverse Natural Language Processing tasks relevant to conversational AI, offering a more accessible evaluation method than automated, academically rigorous test suites.

Quick Start & Requirements

Dataset Access: Prompts are available in CSV format via Tencent Docs: https://docs.qq.com/sheet/DTEFsdkNERVVtR3BX
Requirements: No specific software or hardware prerequisites are mentioned beyond the ability to process CSV files and interact with LLMs.

Highlighted Details

300 Chinese prompts covering basic, advanced, and specialized LLM capabilities.
Designed for qualitative, non-technical evaluation of conversational AI products.
Combines academic benchmarks, practical examples, and emergent LLM abilities.

Maintenance & Community

Developed by Zhenfund with contributions from several individuals.
The project aims for continuous improvement based on user feedback.

Licensing & Compatibility

The dataset is provided by Zhenfund and © 2023. Specific licensing terms are not detailed in the README, but its use appears intended for evaluation purposes.

Limitations & Caveats

The dataset is intended for qualitative assessment and may not be suitable for rigorous academic benchmarking. The creators acknowledge potential omissions and amateur content from a professional NLP perspective, with plans for future updates based on feedback.

Health Check

Last Commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days