loong  by camel-ai

Synthetic data generation project using LLM agents

created 4 months ago
309 stars

Top 88.0% on sourcepulse

GitHubView on GitHub
Project Summary

Loong is an open-source project focused on enabling reasoning-capable AI models to bootstrap themselves by generating and verifying synthetic data. It targets AI researchers and developers looking to scale self-improvement for LLM agents, particularly in domains requiring complex reasoning and verifiable outputs. The project provides a framework and datasets to facilitate this process, aiming to discover scaling laws for agent intelligence.

How It Works

Loong employs an agent-environment loop where a Generator creates synthetic questions and answers from seed datasets. A Verifier then assesses the correctness of these generated responses, often by executing associated rationale code. A Trainable Agent learns iteratively from these verified question-answer pairs, enabling scalable self-improvement through reinforcement learning and advanced strategies. This approach allows models to learn from their own generated data, potentially reducing reliance on expensive human-labeled datasets.

Quick Start & Requirements

  • Install: Primarily through Python packages and potentially Docker. Specific commands are detailed in the project's cookbooks.
  • Prerequisites: Python environment, with specific dependencies varying by domain (e.g., numpy, pandas, torch). Some datasets may require specific libraries for rationale execution.
  • Resources: Setup time and resource requirements depend on the scale of data generation and training.
  • Links: Cookbooks, Datasets, Loong Blog.

Highlighted Details

  • Includes 8,729 questions across 12 diverse domains, including Advanced Math, Physics, Chemistry, Finance, and Programming.
  • Datasets are structured with questions, answers, rationales (often code), and metadata, allowing for automatic verification.
  • Provides modular "Cookbooks" for synthetic data generation, verification, and RL training loops.
  • Actively seeks community contributions for seed datasets, verifiers, and cookbook improvements.

Maintenance & Community

  • Led by the CAMEL team, with contributions from the open-source AI research community.
  • Community channels include Discord and WeChat.
  • Actively seeking participants for an "Initiative Program."

Licensing & Compatibility

  • Code License: LICENSE file (likely MIT or Apache, common for CAMEL projects).
  • Data License: Per-dataset license specified in metadata.json.
  • Compatibility: Suitable for research and potentially commercial use, depending on individual data licenses.

Limitations & Caveats

The effectiveness of the self-bootstrapping process is contingent on the quality and coverage of the initial seed datasets and the accuracy of the verifiers. Some domains may require specific computational environments for rationale execution.

Health Check
Last commit

4 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
56 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

argilla by argilla-io

0.4%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 5 days ago
Feedback? Help us improve.