loong by camel-ai

Synthetic data generation project using LLM agents

Created 11 months ago

487 stars

Top 63.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

Loong is an open-source project focused on enabling reasoning-capable AI models to bootstrap themselves by generating and verifying synthetic data. It targets AI researchers and developers looking to scale self-improvement for LLM agents, particularly in domains requiring complex reasoning and verifiable outputs. The project provides a framework and datasets to facilitate this process, aiming to discover scaling laws for agent intelligence.

How It Works

Loong employs an agent-environment loop where a Generator creates synthetic questions and answers from seed datasets. A Verifier then assesses the correctness of these generated responses, often by executing associated rationale code. A Trainable Agent learns iteratively from these verified question-answer pairs, enabling scalable self-improvement through reinforcement learning and advanced strategies. This approach allows models to learn from their own generated data, potentially reducing reliance on expensive human-labeled datasets.

Quick Start & Requirements

Install: Primarily through Python packages and potentially Docker. Specific commands are detailed in the project's cookbooks.
Prerequisites: Python environment, with specific dependencies varying by domain (e.g., numpy, pandas, torch). Some datasets may require specific libraries for rationale execution.
Resources: Setup time and resource requirements depend on the scale of data generation and training.
Links: Cookbooks, Datasets, Loong Blog.

Highlighted Details

Includes 8,729 questions across 12 diverse domains, including Advanced Math, Physics, Chemistry, Finance, and Programming.
Datasets are structured with questions, answers, rationales (often code), and metadata, allowing for automatic verification.
Provides modular "Cookbooks" for synthetic data generation, verification, and RL training loops.
Actively seeks community contributions for seed datasets, verifiers, and cookbook improvements.

Maintenance & Community

Led by the CAMEL team, with contributions from the open-source AI research community.
Community channels include Discord and WeChat.
Actively seeking participants for an "Initiative Program."

Licensing & Compatibility

Code License: LICENSE file (likely MIT or Apache, common for CAMEL projects).
Data License: Per-dataset license specified in metadata.json.
Compatibility: Suitable for research and potentially commercial use, depending on individual data licenses.

Limitations & Caveats

The effectiveness of the self-bootstrapping process is contingent on the quality and coverage of the initial seed datasets and the accuracy of the verifiers. Some domains may require specific computational environments for rationale execution.

Health Check

Last Commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days