PostTrainBench  by aisa-group

Automating LLM post-training research and development

Created 4 months ago
259 stars

Top 97.7% on SourcePulse

GitHubView on GitHub
Project Summary

PostTrainBench provides a benchmark for evaluating the capabilities of CLI agents in automating the post-training of large language models (LLMs). It targets researchers and engineers interested in AI R&D, offering a standardized method to measure how effectively agents can improve LLM performance within strict compute constraints (10 hours on a single H100 GPU). The project aims to assess an agent's ability to conduct AI research and development autonomously.

How It Works

The benchmark employs CLI agents, such as Claude Code, Codex CLI, Gemini CLI, and OpenCode, to post-train base LLMs like Qwen3, SmolLM3, and Gemma-3. Agents are given access to an evaluation script and a limited compute budget. Their task is to enhance the base LLM's performance on a given benchmark, with success measured by the post-trained model's score. This setup simulates an AI R&D process, evaluating the agent's strategic approach to model improvement.

Quick Start & Requirements

  • Primary install/run command: Use bash containers/build_container.sh standard to build the container, bash containers/download_hf_cache/download_hf_cache.sh to download cache, and bash src/commit_utils/commit.sh to run jobs.
  • Non-default prerequisites: apptainer, fuse-overlayfs, and API keys for agents (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY). Subscription-based agents require specific local credential files (auth.json, oauth_token).
  • Resource footprint: 10 hours on a single H100 GPU per agent run.
  • Links: Official leaderboard available at posttrainbench.com.

Highlighted Details

  • Diverse Benchmarks: Includes 7 evaluation tasks: AIME 2025 (Math), Arena Hard (Writing), BFCL (Function Calling), GPQA (Graduate-level Science), GSM8K (Grade School Math), HealthBench (Medical), and HumanEval (Code Generation).
  • Agent Scaffolds: Supports multiple CLI agent frameworks including Claude Code, Codex CLI, Gemini CLI, and OpenCode, with numerous agent configurations tested.
  • Reward Hacking Mitigation: Implemented mechanisms to detect and penalize "reward hacking" (e.g., evaluation tampering, model substitution), discarding runs where such behavior is detected.
  • Non-API Agent Support: Provides methods for integrating agents that rely on subscription services (e.g., ChatGPT Pro, Claude Max) via local authentication tokens and credentials.

Maintenance & Community

Contributions are welcomed via pull requests, issues, or email. Key contacts include Ben Rank, Hardik Bhatnagar, and Maksym Andriushchenko. The roadmap indicates plans for enhanced data decontamination and improved reward hacking detection methods.

Licensing & Compatibility

The provided README does not specify a software license. Consequently, compatibility for commercial use or linking with closed-source projects is undetermined and requires further clarification.

Limitations & Caveats

The project currently targets the internal HTCondor job scheduler, with planned support for Harbor to facilitate easier deployment on rented cloud hardware. Specific authentication setups are necessary for non-API based agents, and the absence of a stated license presents a potential adoption blocker for commercial applications.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
6
Star History
62 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.