agent-safety-eval-lab by YutoTerashima

LLM agent safety and tool-use evaluation

Created 2 months ago

345 stars

Top 80.1% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> This project provides a reproducible lab for evaluating Large Language Model (LLM) agents as systems, focusing on their complete execution traces, tool usage, policy adherence, and safety outcomes. It targets engineers and researchers needing to assess agent reliability beyond single-message interactions, offering a systematic approach to identify and mitigate risks in complex agent workflows.

How It Works

The lab employs a mock mode by default, allowing for isolated testing and replayability. It integrates with various LLM adapters (OpenAI, Hugging Face, LiteLLM) by normalizing agent traces. A core pipeline records traces, grades tool policy adherence, and applies safety rubrics, culminating in a risk report. This architecture enables detailed analysis of agent behavior, including denied tool counts and latency, within a replayable mock environment.

Quick Start & Requirements

Primary install / run command: pip install -e ".[dev]" within a Python virtual environment (.venv), followed by python examples/run_mock_eval.py.
Non-default prerequisites and dependencies: Python 3.x, virtual environment. Advanced experiments may require CUDA (e.g., conda run -n Transformers python scripts/run_experiment.py --device cuda) and a specific Transformers conda environment.
Links: Architecture, Research Brief, Full Trace Suite.

Highlighted Details

Reproducible Evaluation: Assesses LLM agents by inspecting full execution trajectories, tool calls, and safety outcomes.
Mock Mode & Replayability: Enables testing and analysis without live API calls, using a default mock runner and replayable trace data.
Real-World Data Integration: Includes experiments with a sample from PKU-Alignment/BeaverTails, featuring GPU-backed runs for safety-risk analysis.
Performance Benchmarks: V2 research results show a TF-IDF logistic baseline achieving ~0.774 macro-F1 and ~0.859 AUROC on 50k examples.

Maintenance & Community

No specific community links (e.g., Discord, Slack) or details on maintainers/sponsorships were found in the provided text.

Licensing & Compatibility

The license type is not specified in the provided README content.

Limitations & Caveats

Publicly shared failure examples are redacted to avoid exposing sensitive content, though metadata remains for reproducibility. GPU-backed experiments require specific conda environments and hardware. Some experimental configurations, like the GPU MLP, may require further calibration for production use.

agent-safety-eval-lab by YutoTerashima

Explore Similar Projects

AgentEval by canwhite

AgentDoG by AI45Lab

AgentBoard by hkust-nlp

agentsilex by howl-anderson

claw-eval by claw-eval

spring-ai-alibaba-admin by spring-ai-alibaba

Awesome-Agent-Papers by luo-junyu

AgentSociety by tsinghua-fib-lab

openlit by openlit

agentops by AgentOps-AI

coze-loop by coze-dev

RagaAI-Catalyst by raga-ai-hub