Awesome-Large-Model-Safety  by xingjunm

Surveying safety across large models and AI agents

Created 1 year ago
256 stars

Top 98.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This repository presents a comprehensive survey of safety research concerning large AI models, including LLMs, VLMs, and diffusion models. It offers a structured taxonomy of safety threats, defense strategies, datasets, and benchmarks, aiming to provide researchers and practitioners with a systematic overview of the field. The survey highlights key trends, identifies open challenges, and serves as a foundational resource for understanding and advancing large model safety.

How It Works

The survey systematically reviews safety research across six model categories: Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-based Agents. For each category, research is organized into attacks and defenses, detailing ten attack types (e.g., adversarial, backdoor, jailbreak, prompt injection). A two-level taxonomy (Category → Subcategory) classifies attacks and defenses based on threat models or specific subtasks. The methodology involved keyword-based searches followed by manual filtering, resulting in the analysis of 390 technical papers.

Quick Start & Requirements

This repository contains a survey paper and associated research findings, not executable code. Therefore, there are no installation or runtime requirements.

Highlighted Details

  • Significant surge in large model safety research observed since 2023, coinciding with the release of ChatGPT.
  • LLMs and Diffusion Models (DMs) receive the most research attention, accounting for over 60% of surveyed papers.
  • Jailbreak, adversarial, and backdoor attacks are the most extensively studied attack types.
  • Jailbreak defenses are the primary focus of defense research, followed by adversarial defenses.

Maintenance & Community

A major revision was completed in August 2025, representing the final planned update for this version of the survey. Researchers can submit papers for citation via a provided form to help maintain comprehensiveness.

Licensing & Compatibility

Licensing information is not specified in the provided README.

Limitations & Caveats

This survey represents a final planned update, and further substantial revisions may not be possible. The focus is on summarizing key ideas and approaches, omitting deep technical details and experimental analyses of individual papers.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
27 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.