Discover and explore top open-source AI tools and projects—updated daily.
aisa-groupAutomating LLM post-training research and development
Top 97.7% on SourcePulse
PostTrainBench provides a benchmark for evaluating the capabilities of CLI agents in automating the post-training of large language models (LLMs). It targets researchers and engineers interested in AI R&D, offering a standardized method to measure how effectively agents can improve LLM performance within strict compute constraints (10 hours on a single H100 GPU). The project aims to assess an agent's ability to conduct AI research and development autonomously.
How It Works
The benchmark employs CLI agents, such as Claude Code, Codex CLI, Gemini CLI, and OpenCode, to post-train base LLMs like Qwen3, SmolLM3, and Gemma-3. Agents are given access to an evaluation script and a limited compute budget. Their task is to enhance the base LLM's performance on a given benchmark, with success measured by the post-trained model's score. This setup simulates an AI R&D process, evaluating the agent's strategic approach to model improvement.
Quick Start & Requirements
bash containers/build_container.sh standard to build the container, bash containers/download_hf_cache/download_hf_cache.sh to download cache, and bash src/commit_utils/commit.sh to run jobs.apptainer, fuse-overlayfs, and API keys for agents (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY). Subscription-based agents require specific local credential files (auth.json, oauth_token).posttrainbench.com.Highlighted Details
Maintenance & Community
Contributions are welcomed via pull requests, issues, or email. Key contacts include Ben Rank, Hardik Bhatnagar, and Maksym Andriushchenko. The roadmap indicates plans for enhanced data decontamination and improved reward hacking detection methods.
Licensing & Compatibility
The provided README does not specify a software license. Consequently, compatibility for commercial use or linking with closed-source projects is undetermined and requires further clarification.
Limitations & Caveats
The project currently targets the internal HTCondor job scheduler, with planned support for Harbor to facilitate easier deployment on rented cloud hardware. Specific authentication setups are necessary for non-API based agents, and the absence of a stated license presents a potential adoption blocker for commercial applications.
4 days ago
Inactive
KhoomeiK
THUDM
microsoft