Synthetic data creation research paper using 1B personas
Top 32.0% on sourcepulse
This repository provides PERSONA HUB, a novel methodology and dataset for generating diverse synthetic data at scale using LLMs. It targets researchers and developers aiming to create large, varied datasets for training and evaluating LLMs across various applications like reasoning, instruction following, and knowledge generation. The core benefit is enabling a paradigm shift towards flexible, scalable, and easy-to-use synthetic data creation.
How It Works
The project leverages a "persona-driven data synthesis" methodology. It curates a massive dataset of 1 billion personas from web data, each representing a unique perspective and knowledge base. These personas act as distributed carriers of world knowledge, allowing LLMs to tap into diverse viewpoints for generating synthetic data. This approach facilitates the creation of high-quality, varied datasets by simulating a wide range of real-world user inputs.
Quick Start & Requirements
bash demo_openai_synthesize.sh
(requires datasets
, openai
, and configured openai_api_key
).bash demo_vllm_synthesize.sh
(requires datasets
, transformers
, vllm
).Highlighted Details
Maintenance & Community
Contact: getao@global.tencent.com or dyu@global.tencent.com.
Licensing & Compatibility
The released data is generated by public models (GPT-4, Llama-3, Qwen) and is intended for research purposes only. Users must comply with the respective license agreements and usage policies of these models.
Limitations & Caveats
The data may contain inaccuracies, unsafe content, or biases. There is a high risk of LLM knowledge dumping and replication when using this data for querying target LLMs. Ethical and responsible application is crucial to prevent privacy violations.
5 months ago
Inactive