LLM4Decompile  by albertan017

LLM for decompiling x86-64 binaries into C source code

created 1 year ago
5,817 stars

Top 9.0% on sourcepulse

GitHubView on GitHub
Project Summary

LLM4Decompile is a pioneering open-source project that leverages large language models for the reverse engineering task of decompiling binary code into human-readable C source code. It targets reverse engineers and security researchers seeking to understand compiled software, offering models that can directly decompile binaries or refine existing decompiled pseudo-code.

How It Works

The project employs a pipeline where binary code is first disassembled into assembly language. LLMs, trained on vast datasets of assembly and corresponding C code, then translate this assembly into C. The models are evaluated based on "re-executability," measuring if the generated C code functions correctly by passing predefined test cases. Two main approaches are offered: LLM4Decompile-End directly decompiles binaries, while LLM4Decompile-Ref refines pseudo-code generated by tools like Ghidra.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n 'llm4decompile' python=3.9), activate it, and install requirements (pip install -r requirements.txt).
  • Prerequisites: Python 3.9, GCC, objdump, Hugging Face transformers library, PyTorch. GPU is recommended for inference.
  • Demo: A Colab notebook is available for demonstrating model usage.

Highlighted Details

  • Offers models ranging from 1.3B to 22B parameters, with the 9B-v2 model achieving a 0.6494 re-executability rate.
  • Supports Linux x86_64 binaries compiled with GCC optimization levels O0 to O3.
  • Includes evaluation benchmarks like HumanEval-Decompile and ExeBench.
  • Provides a training script for a 100k sample subset that runs in ~3.5 hours on an A100 GPU.

Maintenance & Community

The project has seen recent updates, including new model releases and training data subsets. Links to Hugging Face models and a Colab notebook are provided.

Licensing & Compatibility

Licensed under the MIT and DeepSeek License. Compatibility for commercial use or closed-source linking should be reviewed against the DeepSeek License terms.

Limitations & Caveats

Currently supports only Linux x86_64 architecture. The re-executability rates, while improved, indicate that the generated code may still require manual correction or debugging.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
321 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley).

DeepSeek-Coder-V2 by deepseek-ai

0.4%
6k
Open-source code language model comparable to GPT4-Turbo
created 1 year ago
updated 10 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 16 hours ago
Feedback? Help us improve.