agent-safety-eval-lab  by YutoTerashima

LLM agent safety and tool-use evaluation

Created 3 weeks ago

New!

360 stars

Top 77.7% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> This project provides a reproducible lab for evaluating Large Language Model (LLM) agents as systems, focusing on their complete execution traces, tool usage, policy adherence, and safety outcomes. It targets engineers and researchers needing to assess agent reliability beyond single-message interactions, offering a systematic approach to identify and mitigate risks in complex agent workflows.

How It Works

The lab employs a mock mode by default, allowing for isolated testing and replayability. It integrates with various LLM adapters (OpenAI, Hugging Face, LiteLLM) by normalizing agent traces. A core pipeline records traces, grades tool policy adherence, and applies safety rubrics, culminating in a risk report. This architecture enables detailed analysis of agent behavior, including denied tool counts and latency, within a replayable mock environment.

Quick Start & Requirements

  • Primary install / run command: pip install -e ".[dev]" within a Python virtual environment (.venv), followed by python examples/run_mock_eval.py.
  • Non-default prerequisites and dependencies: Python 3.x, virtual environment. Advanced experiments may require CUDA (e.g., conda run -n Transformers python scripts/run_experiment.py --device cuda) and a specific Transformers conda environment.
  • Links: Architecture, Research Brief, Full Trace Suite.

Highlighted Details

  • Reproducible Evaluation: Assesses LLM agents by inspecting full execution trajectories, tool calls, and safety outcomes.
  • Mock Mode & Replayability: Enables testing and analysis without live API calls, using a default mock runner and replayable trace data.
  • Real-World Data Integration: Includes experiments with a sample from PKU-Alignment/BeaverTails, featuring GPU-backed runs for safety-risk analysis.
  • Performance Benchmarks: V2 research results show a TF-IDF logistic baseline achieving ~0.774 macro-F1 and ~0.859 AUROC on 50k examples.

Maintenance & Community

No specific community links (e.g., Discord, Slack) or details on maintainers/sponsorships were found in the provided text.

Licensing & Compatibility

The license type is not specified in the provided README content.

Limitations & Caveats

Publicly shared failure examples are redacted to avoid exposing sensitive content, though metadata remains for reproducibility. GPU-backed experiments require specific conda environments and hardware. Some experimental configurations, like the GPU MLP, may require further calibration for production use.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
361 stars in the last 26 days

Explore Similar Projects

Feedback? Help us improve.