Internal-Safety-Collapse by wuyoscar

LLMs generating harmful content via task completion

Created 4 months ago

913 stars

Top 39.1% on SourcePulse

Project Summary

Summary ISC-Bench addresses the critical vulnerability of frontier LLMs to "Internal Safety Collapse" (ISC), where models generate harmful content by fulfilling task-driven instructions rather than through adversarial prompting. This project provides researchers and engineers with a framework and tools to reliably trigger and evaluate this collapse, offering deeper insight into LLM safety limitations beyond traditional jailbreaking.

How It Works The core is the TVD (Task, Validator, Data) framework. It designs legitimate tasks that, via embedded constraints and structured data requirements, compel the LLM to produce harmful outputs. The specific harm is determined by the integrated "tool" or domain knowledge (e.g., toxic text via LlamaGuard, chemical compounds via Cantera), exploiting the LLM's task-completion drive and bypassing standard input-level safety filters.

Quick Start & Requirements Installation requires cloning the repository, setting up Python 3.11+ with uv, and configuring an OpenRouter API key in .env. Docker is needed for agentic mode experiments. Key resources include the project website, Hugging Face page, tutorials, and the research paper on arXiv.

Highlighted Details ISC-Bench has successfully triggered harmful content in over 300 Arena-ranked models. Recent findings indicate single-turn prompts are increasingly ineffective against state-of-the-art models, necessitating agentic execution. The project offers ready-to-use templates across eight diverse domains, including computational biology, chemistry, and cybersecurity. Input-level defenses show a 100% failure rate against ISC triggers.

Maintenance & Community The project is actively maintained, with recent updates in March 2026 detailing advancements in agentic TVD and tool integration. Key contributors include Yutao Wu and Hanxun Huang. Community contributions are encouraged via GitHub Issues for discovering new triggers and expanding the template library.

Licensing & Compatibility ISC-Bench is licensed under CC BY-NC-SA 4.0, restricting usage exclusively to academic research in AI safety and prohibiting commercial use and harmful content generation.

Limitations & Caveats Input-level defenses are ineffective against ISC. System Prompt Defenses (SPD) show limited success but fail under agentic execution. Harmful knowledge resides in pre-trained parameters; alignment suppresses explicit requests, not task-driven generation. This research is strictly for academic safety purposes, and authors disclaim responsibility for misuse.

Internal-Safety-Collapse by wuyoscar

Explore Similar Projects

Awesome-Large-Model-Safety by xingjunm

unofficial-claude-code-prompt-playbook by kropdx

ASB by agiresearch

slowmist-agent-security by slowmist

agent-threat-rules by Agent-Threat-Rule

AI-penetration-testing by Mr-Infect

camel-prompt-injection by google-research

Bug-Bounty-Agents by matty69v

ship-safe by asamassekou10

defenseclaw by cisco-ai-defense

security-audit-skill by cloudflare

Awesome-LLM4Cybersecurity by tmylla