LLM4Decompile by albertan017

LLM for decompiling x86-64 binaries into C source code

Created 1 year ago

6,240 stars

Top 8.2% on SourcePulse

View on GitHub

2 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Elvis Saravia

Founder of DAIR.AI

Project Summary

LLM4Decompile is a pioneering open-source project that leverages large language models for the reverse engineering task of decompiling binary code into human-readable C source code. It targets reverse engineers and security researchers seeking to understand compiled software, offering models that can directly decompile binaries or refine existing decompiled pseudo-code.

How It Works

The project employs a pipeline where binary code is first disassembled into assembly language. LLMs, trained on vast datasets of assembly and corresponding C code, then translate this assembly into C. The models are evaluated based on "re-executability," measuring if the generated C code functions correctly by passing predefined test cases. Two main approaches are offered: LLM4Decompile-End directly decompiles binaries, while LLM4Decompile-Ref refines pseudo-code generated by tools like Ghidra.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n 'llm4decompile' python=3.9), activate it, and install requirements (pip install -r requirements.txt).
Prerequisites: Python 3.9, GCC, objdump, Hugging Face transformers library, PyTorch. GPU is recommended for inference.
Demo: A Colab notebook is available for demonstrating model usage.

Highlighted Details

Offers models ranging from 1.3B to 22B parameters, with the 9B-v2 model achieving a 0.6494 re-executability rate.
Supports Linux x86_64 binaries compiled with GCC optimization levels O0 to O3.
Includes evaluation benchmarks like HumanEval-Decompile and ExeBench.
Provides a training script for a 100k sample subset that runs in ~3.5 hours on an A100 GPU.

Maintenance & Community

The project has seen recent updates, including new model releases and training data subsets. Links to Hugging Face models and a Colab notebook are provided.

Licensing & Compatibility

Licensed under the MIT and DeepSeek License. Compatibility for commercial use or closed-source linking should be reviewed against the DeepSeek License terms.

Limitations & Caveats

Currently supports only Linux x86_64 architecture. The re-executability rates, while improved, indicate that the generated code may still require manual correction or debugging.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

70 stars in the last 30 days