persona-hub  by tencent-ailab

Synthetic data creation research paper using 1B personas

created 1 year ago
1,270 stars

Top 32.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides PERSONA HUB, a novel methodology and dataset for generating diverse synthetic data at scale using LLMs. It targets researchers and developers aiming to create large, varied datasets for training and evaluating LLMs across various applications like reasoning, instruction following, and knowledge generation. The core benefit is enabling a paradigm shift towards flexible, scalable, and easy-to-use synthetic data creation.

How It Works

The project leverages a "persona-driven data synthesis" methodology. It curates a massive dataset of 1 billion personas from web data, each representing a unique perspective and knowledge base. These personas act as distributed carriers of world knowledge, allowing LLMs to tap into diverse viewpoints for generating synthetic data. This approach facilitates the creation of high-quality, varied datasets by simulating a wide range of real-world user inputs.

Quick Start & Requirements

  • OpenAI Demo: bash demo_openai_synthesize.sh (requires datasets, openai, and configured openai_api_key).
  • Open-Source Demo: bash demo_vllm_synthesize.sh (requires datasets, transformers, vllm).
  • Data Preview: Available on Hugging Face.

Highlighted Details

  • Released 370 million "elite" personas (top 1% or 0.1% skills).
  • Supports data synthesis using GPT-4o or open-source models via vLLM.
  • Offers pre-generated synthetic data samples: 50k math, 50k logic, 50k instructions, 10k knowledge texts, 10k NPCs, 5k tools.
  • Customizable prompt templates are available.

Maintenance & Community

Contact: getao@global.tencent.com or dyu@global.tencent.com.

Licensing & Compatibility

The released data is generated by public models (GPT-4, Llama-3, Qwen) and is intended for research purposes only. Users must comply with the respective license agreements and usage policies of these models.

Limitations & Caveats

The data may contain inaccuracies, unsafe content, or biases. There is a high risk of LLM knowledge dumping and replication when using this data for querying target LLMs. Ethical and responsible application is crucial to prevent privacy violations.

Health Check
Last commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
139 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.