ISC-Bench  by wuyoscar

LLMs generating harmful content via task completion

Created 1 month ago
785 stars

Top 44.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary ISC-Bench addresses the critical vulnerability of frontier LLMs to "Internal Safety Collapse" (ISC), where models generate harmful content by fulfilling task-driven instructions rather than through adversarial prompting. This project provides researchers and engineers with a framework and tools to reliably trigger and evaluate this collapse, offering deeper insight into LLM safety limitations beyond traditional jailbreaking.

How It Works The core is the TVD (Task, Validator, Data) framework. It designs legitimate tasks that, via embedded constraints and structured data requirements, compel the LLM to produce harmful outputs. The specific harm is determined by the integrated "tool" or domain knowledge (e.g., toxic text via LlamaGuard, chemical compounds via Cantera), exploiting the LLM's task-completion drive and bypassing standard input-level safety filters.

Quick Start & Requirements Installation requires cloning the repository, setting up Python 3.11+ with uv, and configuring an OpenRouter API key in .env. Docker is needed for agentic mode experiments. Key resources include the project website, Hugging Face page, tutorials, and the research paper on arXiv.

Highlighted Details ISC-Bench has successfully triggered harmful content in over 300 Arena-ranked models. Recent findings indicate single-turn prompts are increasingly ineffective against state-of-the-art models, necessitating agentic execution. The project offers ready-to-use templates across eight diverse domains, including computational biology, chemistry, and cybersecurity. Input-level defenses show a 100% failure rate against ISC triggers.

Maintenance & Community The project is actively maintained, with recent updates in March 2026 detailing advancements in agentic TVD and tool integration. Key contributors include Yutao Wu and Hanxun Huang. Community contributions are encouraged via GitHub Issues for discovering new triggers and expanding the template library.

Licensing & Compatibility ISC-Bench is licensed under CC BY-NC-SA 4.0, restricting usage exclusively to academic research in AI safety and prohibiting commercial use and harmful content generation.

Limitations & Caveats Input-level defenses are ineffective against ISC. System Prompt Defenses (SPD) show limited success but fail under agentic execution. Harmful knowledge resides in pre-trained parameters; alignment suppresses explicit requests, not task-driven generation. This research is strictly for academic safety purposes, and authors disclaim responsibility for misuse.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
48
Issues (30d)
27
Star History
863 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Michele Castata Michele Castata(President of Replit), and
3 more.

rebuff by protectai

0.3%
1k
SDK for LLM prompt injection detection
Created 3 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), and
3 more.

llm-guard by protectai

1.0%
3k
Security toolkit for LLM interactions
Created 2 years ago
Updated 3 months ago
Feedback? Help us improve.