Binary reverse engineering via LLM
Top 75.9% on sourcepulse
This project provides a prototype for whole-program reverse engineering using GPT-3, aiming to summarize complex binaries by recursively summarizing function dependencies. It's targeted at reverse engineers and security researchers who need to understand large codebases but are limited by the context window of large language models. The primary benefit is generating natural language summaries of program functionality, even for large or complex software.
How It Works
GPT-WPRE leverages Ghidra for decompilation and call graph extraction. It then employs a recursive summarization strategy: functions are summarized in an order determined by a topological sort of the call graph, with callees' summaries provided as context for their callers. For functions exceeding the LLM's context limit, it attempts to summarize sequential chunks of code, recursively reducing chunk size if necessary.
Quick Start & Requirements
pip install -r requirements.txt
ghidra_bridge
enabled.python extract_ghidra_decomp.py
python recursive_summarize.py -f <function_name> <program_directory>
Highlighted Details
text-davinci-003
model.--dry-run
flag to estimate API costs.extras/debug_summaries.py
) for side-by-side comparison of source, decompiled code, and summaries.Maintenance & Community
The project appears to be a personal prototype by "moyix" with no explicit mention of community channels, ongoing development, or sponsorships.
Licensing & Compatibility
The README does not specify a license.
Limitations & Caveats
The tool is a "toy prototype" tested on only one program (libpng) and may not generalize well. It does not handle mutual recursion or cycles in the call graph, which will cause exceptions. The summarization prompts are basic and could likely be improved with prompt engineering. API costs can be significant for full program analysis.
2 years ago
Inactive