OpenClaw-PwnKit  by imbue-bit

LLM agent RCE via adversarial tool-call hijacking

Created 1 month ago
281 stars

Top 92.6% on SourcePulse

GitHubView on GitHub
Project Summary

This research framework demonstrates a critical security threat against LLM agents, enabling Remote Code Execution (RCE) through adversarial tool-call hijacking. It targets security researchers and engineers evaluating LLM agent vulnerabilities, providing a method to bypass safety alignments by exploiting continuous embedding space perturbations. The primary benefit is advancing the understanding of these sophisticated attacks to drive the development of more robust defenses.

How It Works

The project employs a black-box attack strategy using CMA-ES (Covariance Matrix Adaptation Evolution Strategy) to optimize adversarial triggers in the token embedding space. It leverages a surrogate model (Phi-2) to extract token embeddings, applies PCA for dimensionality reduction, and uses FAISS for mapping continuous vectors back to discrete tokens. This derivative-free optimization approach is advantageous as it does not require access to model gradients, making it effective against closed-source frontier LLMs.

Quick Start & Requirements

  • Installation: Clone the repository, create and activate a Python virtual environment, and install dependencies via pip install -r requirements.txt.
  • Prerequisites: Python 3.10+, PyTorch, Transformers, FAISS (faiss-cpu), CMA, scikit-learn, FastAPI, OpenAI SDK, Rich, loguru. An OpenAI API key is required. The surrogate model (microsoft/phi-2, ~5 GB) is downloaded automatically.
  • Configuration: Edit config.yaml to set the C2 server URL and OpenAI API key.
  • Usage: Interactive CLI (python pwnkit_cli.py) or programmatic API.
  • Links: GitHub Repository

Highlighted Details

  • Demonstrates black-box adversarial attacks on LLM agent tool-calling via CMA-ES in token embedding space.
  • Achieves Remote Code Execution (RCE) by inducing adversarial tool-call hijacking.
  • Effective against closed-source frontier models (e.g., GPT-4, Claude 3) without gradient access.
  • Includes multiple attack vectors: CMA-ES trigger optimization, naive prompt injection, honeypot delivery, and skill poisoning.

Maintenance & Community

No specific details regarding maintainers, sponsorships, or community channels (e.g., Discord, Slack) are provided in the README. The project welcomes Pull Requests.

Licensing & Compatibility

Licensed under the GNU General Public License v3.0 (GPL v3). This is a strong copyleft license, requiring derivative works to also be licensed under GPL v3. Compatibility for commercial use or linking with closed-source software requires careful review of GPL v3 obligations.

Limitations & Caveats

This tool is released strictly for academic research and authorized security testing. A full optimization run can incur significant API costs (estimated $50–200 USD) and requires substantial compute resources (several hours wall time, ~8 GB GPU memory). The accompanying paper is listed as "Coming Soon." The toolkit specifically targets OpenClaw-based agents as a reference implementation.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
51 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Michele Castata Michele Castata(President of Replit), and
3 more.

rebuff by protectai

0.3%
1k
SDK for LLM prompt injection detection
Created 3 years ago
Updated 1 year ago
Starred by Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

llm-attacks by llm-attacks

0.1%
5k
Attack framework for aligned LLMs, based on a research paper
Created 2 years ago
Updated 1 year ago
Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

PurpleLlama by meta-llama

0.4%
4k
LLM security toolkit for assessing/improving generative AI models
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.