LLM for decompiling x86-64 binaries into C source code
Top 9.0% on sourcepulse
LLM4Decompile is a pioneering open-source project that leverages large language models for the reverse engineering task of decompiling binary code into human-readable C source code. It targets reverse engineers and security researchers seeking to understand compiled software, offering models that can directly decompile binaries or refine existing decompiled pseudo-code.
How It Works
The project employs a pipeline where binary code is first disassembled into assembly language. LLMs, trained on vast datasets of assembly and corresponding C code, then translate this assembly into C. The models are evaluated based on "re-executability," measuring if the generated C code functions correctly by passing predefined test cases. Two main approaches are offered: LLM4Decompile-End directly decompiles binaries, while LLM4Decompile-Ref refines pseudo-code generated by tools like Ghidra.
Quick Start & Requirements
conda create -n 'llm4decompile' python=3.9
), activate it, and install requirements (pip install -r requirements.txt
).transformers
library, PyTorch. GPU is recommended for inference.Highlighted Details
Maintenance & Community
The project has seen recent updates, including new model releases and training data subsets. Links to Hugging Face models and a Colab notebook are provided.
Licensing & Compatibility
Licensed under the MIT and DeepSeek License. Compatibility for commercial use or closed-source linking should be reviewed against the DeepSeek License terms.
Limitations & Caveats
Currently supports only Linux x86_64 architecture. The re-executability rates, while improved, indicate that the generated code may still require manual correction or debugging.
1 month ago
1 day