PostTrainBench by aisa-group

Automating LLM post-training research and development

Created 7 months ago

431 stars

Top 68.2% on SourcePulse

View on GitHub

4 Experts Love This Project

Vincent Weisser

Cofounder of Prime Intellect

Wei-Lin Chiang

Cofounder of LMArena

Pawel Garbacki

Cofounder of Fireworks AI

Elvis Saravia

Founder of DAIR.AI

Project Summary

PostTrainBench provides a benchmark for evaluating the capabilities of CLI agents in automating the post-training of large language models (LLMs). It targets researchers and engineers interested in AI R&D, offering a standardized method to measure how effectively agents can improve LLM performance within strict compute constraints (10 hours on a single H100 GPU). The project aims to assess an agent's ability to conduct AI research and development autonomously.

How It Works

The benchmark employs CLI agents, such as Claude Code, Codex CLI, Gemini CLI, and OpenCode, to post-train base LLMs like Qwen3, SmolLM3, and Gemma-3. Agents are given access to an evaluation script and a limited compute budget. Their task is to enhance the base LLM's performance on a given benchmark, with success measured by the post-trained model's score. This setup simulates an AI R&D process, evaluating the agent's strategic approach to model improvement.

Quick Start & Requirements

Primary install/run command: Use bash containers/build_container.sh standard to build the container, bash containers/download_hf_cache/download_hf_cache.sh to download cache, and bash src/commit_utils/commit.sh to run jobs.
Non-default prerequisites: apptainer, fuse-overlayfs, and API keys for agents (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY). Subscription-based agents require specific local credential files (auth.json, oauth_token).
Resource footprint: 10 hours on a single H100 GPU per agent run.
Links: Official leaderboard available at posttrainbench.com.

Highlighted Details

Diverse Benchmarks: Includes 7 evaluation tasks: AIME 2025 (Math), Arena Hard (Writing), BFCL (Function Calling), GPQA (Graduate-level Science), GSM8K (Grade School Math), HealthBench (Medical), and HumanEval (Code Generation).
Agent Scaffolds: Supports multiple CLI agent frameworks including Claude Code, Codex CLI, Gemini CLI, and OpenCode, with numerous agent configurations tested.
Reward Hacking Mitigation: Implemented mechanisms to detect and penalize "reward hacking" (e.g., evaluation tampering, model substitution), discarding runs where such behavior is detected.
Non-API Agent Support: Provides methods for integrating agents that rely on subscription services (e.g., ChatGPT Pro, Claude Max) via local authentication tokens and credentials.

Maintenance & Community

Contributions are welcomed via pull requests, issues, or email. Key contacts include Ben Rank, Hardik Bhatnagar, and Maksym Andriushchenko. The roadmap indicates plans for enhanced data decontamination and improved reward hacking detection methods.

Licensing & Compatibility

The provided README does not specify a software license. Consequently, compatibility for commercial use or linking with closed-source projects is undetermined and requires further clarification.

Limitations & Caveats

The project currently targets the internal HTCondor job scheduler, with planned support for Harbor to facilitate easier deployment on rented cloud hardware. Specific authentication setups are necessary for non-API based agents, and the absence of a stated license presents a potential adoption blocker for commercial applications.

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

76 stars in the last 30 days