persona-hub by tencent-ailab

Synthetic data creation research paper using 1B personas

Created 1 year ago

1,437 stars

Top 28.1% on SourcePulse

Project Summary

This repository provides PERSONA HUB, a novel methodology and dataset for generating diverse synthetic data at scale using LLMs. It targets researchers and developers aiming to create large, varied datasets for training and evaluating LLMs across various applications like reasoning, instruction following, and knowledge generation. The core benefit is enabling a paradigm shift towards flexible, scalable, and easy-to-use synthetic data creation.

How It Works

The project leverages a "persona-driven data synthesis" methodology. It curates a massive dataset of 1 billion personas from web data, each representing a unique perspective and knowledge base. These personas act as distributed carriers of world knowledge, allowing LLMs to tap into diverse viewpoints for generating synthetic data. This approach facilitates the creation of high-quality, varied datasets by simulating a wide range of real-world user inputs.

Quick Start & Requirements

OpenAI Demo: bash demo_openai_synthesize.sh (requires datasets, openai, and configured openai_api_key).
Open-Source Demo: bash demo_vllm_synthesize.sh (requires datasets, transformers, vllm).
Data Preview: Available on Hugging Face.

Highlighted Details

Released 370 million "elite" personas (top 1% or 0.1% skills).
Supports data synthesis using GPT-4o or open-source models via vLLM.
Offers pre-generated synthetic data samples: 50k math, 50k logic, 50k instructions, 10k knowledge texts, 10k NPCs, 5k tools.
Customizable prompt templates are available.

Maintenance & Community

Contact: getao@global.tencent.com or dyu@global.tencent.com.

Licensing & Compatibility

The released data is generated by public models (GPT-4, Llama-3, Qwen) and is intended for research purposes only. Users must comply with the respective license agreements and usage policies of these models.

Limitations & Caveats

The data may contain inaccuracies, unsafe content, or biases. There is a high risk of LLM knowledge dumping and replication when using this data for querying target LLMs. Ethical and responsible application is crucial to prevent privacy violations.

persona-hub by tencent-ailab

Explore Similar Projects

DataArc-SynData-Toolkit by DataArcTech

awesome-synthetic-datasets by davanstrien

Echo-4o by yejy53

loong by camel-ai

DataDesigner by NVIDIA-NeMo

DataDreamer by datadreamer-dev

be_great by tabularis-ai

mostlyai by mostly-ai

synthetic-data-generator by argilla-io

Awesome-LLM-Synthetic-Data by wasiahmad

curator by bespokelabsai

Kiln by Kiln-AI