LLM4Decompile  by albertan017

LLM for decompiling x86-64 binaries into C source code

Created 1 year ago
5,972 stars

Top 8.7% on SourcePulse

GitHubView on GitHub
Project Summary

LLM4Decompile is a pioneering open-source project that leverages large language models for the reverse engineering task of decompiling binary code into human-readable C source code. It targets reverse engineers and security researchers seeking to understand compiled software, offering models that can directly decompile binaries or refine existing decompiled pseudo-code.

How It Works

The project employs a pipeline where binary code is first disassembled into assembly language. LLMs, trained on vast datasets of assembly and corresponding C code, then translate this assembly into C. The models are evaluated based on "re-executability," measuring if the generated C code functions correctly by passing predefined test cases. Two main approaches are offered: LLM4Decompile-End directly decompiles binaries, while LLM4Decompile-Ref refines pseudo-code generated by tools like Ghidra.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n 'llm4decompile' python=3.9), activate it, and install requirements (pip install -r requirements.txt).
  • Prerequisites: Python 3.9, GCC, objdump, Hugging Face transformers library, PyTorch. GPU is recommended for inference.
  • Demo: A Colab notebook is available for demonstrating model usage.

Highlighted Details

  • Offers models ranging from 1.3B to 22B parameters, with the 9B-v2 model achieving a 0.6494 re-executability rate.
  • Supports Linux x86_64 binaries compiled with GCC optimization levels O0 to O3.
  • Includes evaluation benchmarks like HumanEval-Decompile and ExeBench.
  • Provides a training script for a 100k sample subset that runs in ~3.5 hours on an A100 GPU.

Maintenance & Community

The project has seen recent updates, including new model releases and training data subsets. Links to Hugging Face models and a Colab notebook are provided.

Licensing & Compatibility

Licensed under the MIT and DeepSeek License. Compatibility for commercial use or closed-source linking should be reviewed against the DeepSeek License terms.

Limitations & Caveats

Currently supports only Linux x86_64 architecture. The re-executability rates, while improved, indicate that the generated code may still require manual correction or debugging.

Health Check
Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
2
Star History
110 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.