Discover and explore top open-source AI tools and projects—updated daily.
benchflow-aiFramework for benchmarking AI agents in complex, sandboxed environments
Top 95.4% on SourcePulse
Summary
BenchFlow provides a framework for creating high-fidelity, complex RL environments and evaluation tasks. It enables benchmarking of AI agents—single-agent, multi-agent, and multi-round—within sandboxed environments using a unified scene-based lifecycle, offering researchers and developers a robust platform for rigorous agent evaluation.
How It Works
BenchFlow utilizes a scene-based lifecycle for single-agent, multi-agent (e.g., coder+reviewer), and multi-round scenarios, integrating with commercial and custom agents via a Python BaseUser callback. Sandboxing options include Docker (local), Daytona (parallel cloud), and Modal (serverless/GPU). A hardened verifier defaults to mitigating reward-hacking, with opt-out features for specific tasks.
Quick Start & Requirements
Install via pip: pip install --upgrade benchflow. For uv CLI, use uv tool install --prerelease allow --upgrade 'benchflow==0.5.2' (public) or uv tool install --prerelease allow --upgrade benchflow (internal preview). Requires Python 3.12+ and uv. Authentication for cloud backends or agents is handled via environment variables or login commands. Documentation starts with "Getting started" and "Concepts."
Highlighted Details
Maintenance & Community
Contributions are welcomed via PRs to main, with CI enforcing linting and tests. Development follows a release channel model: main merges publish internal previews, while tagged releases create public PyPI distributions. No specific community channels are mentioned.
Licensing & Compatibility
Distributed under the Apache-2.0 license, generally permitting commercial use and integration into closed-source projects.
Limitations & Caveats
The current public release is 0.5.2. The public CLI install requires --prerelease allow due to a pinned LiteLLM release-candidate dependency. The framework is actively developed, with internal preview releases available.
9 hours ago
Inactive