Discover and explore top open-source AI tools and projects—updated daily.
AI45LabAI agent safety and security diagnostic guardrail framework
Top 78.9% on SourcePulse
Summary
AgentDoG is a diagnostic guardrail framework for AI agent safety and security, focusing on trajectory-level risk assessment. It addresses the limitations of single-step moderation by analyzing multi-step agent executions to detect mid-trajectory safety issues. This enables developers and researchers to build more reliable and secure tool-using AI agents.
How It Works
The framework analyzes full agent execution traces (observations, reasoning, actions) for multi-step interactions. It employs taxonomy-guided diagnosis, assigning fine-grained risk labels (source, failure mode, harm) to explain why unsafe behavior occurs and trace it to specific planning or tool selection steps. This trajectory-level monitoring provides deeper insights into agent failures than traditional output filtering.
Quick Start & Requirements
Deployment utilizes SGLang (>=0.4.6) or vLLM (>=0.10.0) for OpenAI-compatible API endpoints. Example commands:
python -m sglang.launch_server --model-path AI45Research/AgentDoG-Qwen3-4B --port 30000 --context-length 16384vllm serve AI45Research/AgentDoG-Qwen3-4B --port 8000 --max-model-len 16384
Pre-trained models (e.g., AgentDoG-Qwen3-4B, AgentDoG-Qwen2.5-7B, AgentDoG-Llama3.1-8B) are available on Hugging Face/ModelScope organizations. The ATBench dataset is provided for evaluation. Links to documentation and technical reports are accessible via the organizations.Highlighted Details
Maintenance & Community
The README provides no specific details on community channels, active maintainers, roadmaps, or sponsorships.
Licensing & Compatibility
Released under the Apache 2.0 License, permitting commercial use and integration into closed-source projects.
Limitations & Caveats
While AgentDoG shows high accuracy in overall trajectory safety and risk source classification, its fine-grained diagnosis of failure modes has lower accuracy (32.4% for the best FG model). This suggests that pinpointing the exact failure mode within an agent's logic remains a challenge, despite effective detection of unsafe trajectories and risk origins.
2 weeks ago
Inactive
aliasrobotics