Better-Ruozhiba  by FunnySaltyFish

Enhanced Chinese QA dataset for LLM training

Created 1 year ago
250 stars

Top 100.0% on SourcePulse

GitHubView on GitHub
Project Summary

This repository offers a meticulously curated Chinese Question-Answering (QA) dataset, "Better Ruozhiba," designed to enhance large language models. It addresses the need for high-quality, human-verified conversational data by refining an existing dataset derived from "Ruozhiba" (weak-minded吧). The project's primary benefit is providing a cleaner, more reliable Chinese corpus, improving LLM performance in understanding and generating natural language.

How It Works

The dataset is built upon an existing collection, with a significant emphasis on human intervention. Each question-answer pair undergoes manual review to correct formatting errors and improve the quality and accuracy of the responses. This human-centric approach aims to create a more robust and reliable Chinese language corpus compared to automatically generated or less-vetted datasets, focusing on natural language understanding and generation.

Quick Start & Requirements

No installation or specific requirements are detailed in the provided text. The project appears to be a dataset release.

Highlighted Details

  • Features human-vetted and modified Q&A pairs, addressing limitations of the original GPT-4 generated answers.
  • Aims to provide a high-quality Chinese language corpus for LLM training, enhancing conversational capabilities.
  • Derived from the "Ruozhiba" dataset, with the original project available at https://huggingface.co/datasets/LooksJuicy/ruozhiba.
  • Includes a bibtex citation for academic or research use.

Maintenance & Community

Contributions are welcomed via a linked GitHub issue. A contributor list is maintained.

Licensing & Compatibility

The project is licensed under Apache-2.0, which generally permits commercial use and modification, provided attribution and license terms are followed.

Limitations & Caveats

While human-curated, the dataset originates from "Ruozhiba" (weak-minded吧), a source known for informal, subjective, and potentially nonsensical content. Users should be aware that despite modifications, some inherent biases or informalities may persist. No specific technical limitations or alpha status are mentioned.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Kaichao You Kaichao You(Core Maintainer of vLLM), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

z-bench by zhenbench

0%
502
Chinese LLM prompt dataset for non-technical users
Created 2 years ago
Updated 2 years ago
Feedback? Help us improve.