AgentDoG by AI45Lab

AI agent safety and security diagnostic guardrail framework

Created 1 month ago

355 stars

Top 78.9% on SourcePulse

Project Summary

Summary

AgentDoG is a diagnostic guardrail framework for AI agent safety and security, focusing on trajectory-level risk assessment. It addresses the limitations of single-step moderation by analyzing multi-step agent executions to detect mid-trajectory safety issues. This enables developers and researchers to build more reliable and secure tool-using AI agents.

How It Works

The framework analyzes full agent execution traces (observations, reasoning, actions) for multi-step interactions. It employs taxonomy-guided diagnosis, assigning fine-grained risk labels (source, failure mode, harm) to explain why unsafe behavior occurs and trace it to specific planning or tool selection steps. This trajectory-level monitoring provides deeper insights into agent failures than traditional output filtering.

Quick Start & Requirements

Deployment utilizes SGLang (>=0.4.6) or vLLM (>=0.10.0) for OpenAI-compatible API endpoints. Example commands:

SGLang: python -m sglang.launch_server --model-path AI45Research/AgentDoG-Qwen3-4B --port 30000 --context-length 16384
vLLM: vllm serve AI45Research/AgentDoG-Qwen3-4B --port 8000 --max-model-len 16384 Pre-trained models (e.g., AgentDoG-Qwen3-4B, AgentDoG-Qwen2.5-7B, AgentDoG-Llama3.1-8B) are available on Hugging Face/ModelScope organizations. The ATBench dataset is provided for evaluation. Links to documentation and technical reports are accessible via the organizations.

Highlighted Details

Achieves state-of-the-art performance on R-Judge, ASSE-Safety, and ATBench benchmarks.
Excels at detecting long-horizon instruction hijacking and tool misuse.
Demonstrates strong generalization across agent frameworks and LLM backbones.
Introduces Agentic XAI Attribution for explaining agent decision-making drivers.
Features ATBench with a significantly larger tool library than prior benchmarks.

Maintenance & Community

The README provides no specific details on community channels, active maintainers, roadmaps, or sponsorships.

Licensing & Compatibility

Released under the Apache 2.0 License, permitting commercial use and integration into closed-source projects.

Limitations & Caveats

While AgentDoG shows high accuracy in overall trajectory safety and risk source classification, its fine-grained diagnosis of failure modes has lower accuracy (32.4% for the best FG model). This suggests that pinpointing the exact failure mode within an agent's logic remains a challenge, despite effective detection of unsafe trajectories and risk origins.

AgentDoG by AI45Lab

Explore Similar Projects

OS-Agent-Survey by OS-Agent-Survey

open-edison by Edison-Watch

awesome-computer-use by ranpox

MAST by multi-agent-systems-failure-taxonomy

awesome-cybersecurity-agentic-ai by raphabot

OpenDerisk by derisk-ai

agentic-radar by splx-ai

awesome-ai-sdks by e2b-dev

awesome_ai_agents by jim-schwoebel

agentops by AgentOps-AI

cai by aliasrobotics

RagaAI-Catalyst by raga-ai-hub