AgentDoG  by AI45Lab

AI agent safety and security diagnostic guardrail framework

Created 1 month ago
355 stars

Top 78.9% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

AgentDoG is a diagnostic guardrail framework for AI agent safety and security, focusing on trajectory-level risk assessment. It addresses the limitations of single-step moderation by analyzing multi-step agent executions to detect mid-trajectory safety issues. This enables developers and researchers to build more reliable and secure tool-using AI agents.

How It Works

The framework analyzes full agent execution traces (observations, reasoning, actions) for multi-step interactions. It employs taxonomy-guided diagnosis, assigning fine-grained risk labels (source, failure mode, harm) to explain why unsafe behavior occurs and trace it to specific planning or tool selection steps. This trajectory-level monitoring provides deeper insights into agent failures than traditional output filtering.

Quick Start & Requirements

Deployment utilizes SGLang (>=0.4.6) or vLLM (>=0.10.0) for OpenAI-compatible API endpoints. Example commands:

  • SGLang: python -m sglang.launch_server --model-path AI45Research/AgentDoG-Qwen3-4B --port 30000 --context-length 16384
  • vLLM: vllm serve AI45Research/AgentDoG-Qwen3-4B --port 8000 --max-model-len 16384 Pre-trained models (e.g., AgentDoG-Qwen3-4B, AgentDoG-Qwen2.5-7B, AgentDoG-Llama3.1-8B) are available on Hugging Face/ModelScope organizations. The ATBench dataset is provided for evaluation. Links to documentation and technical reports are accessible via the organizations.

Highlighted Details

  • Achieves state-of-the-art performance on R-Judge, ASSE-Safety, and ATBench benchmarks.
  • Excels at detecting long-horizon instruction hijacking and tool misuse.
  • Demonstrates strong generalization across agent frameworks and LLM backbones.
  • Introduces Agentic XAI Attribution for explaining agent decision-making drivers.
  • Features ATBench with a significantly larger tool library than prior benchmarks.

Maintenance & Community

The README provides no specific details on community channels, active maintainers, roadmaps, or sponsorships.

Licensing & Compatibility

Released under the Apache 2.0 License, permitting commercial use and integration into closed-source projects.

Limitations & Caveats

While AgentDoG shows high accuracy in overall trajectory safety and risk source classification, its fine-grained diagnosis of failure modes has lower accuracy (32.4% for the best FG model). This suggests that pinpointing the exact failure mode within an agent's logic remains a challenge, despite effective detection of unsafe trajectories and risk origins.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
347 stars in the last 30 days

Explore Similar Projects

Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

cai by aliasrobotics

0.9%
7k
Cybersecurity AI (CAI) is an open framework for building AI-driven cybersecurity tools
Created 11 months ago
Updated 3 weeks ago
Feedback? Help us improve.