devign  by epicosy

Code vulnerability detection via graph neural networks

Created 5 years ago
253 stars

Top 99.4% on SourcePulse

GitHubView on GitHub
Project Summary

Devign: Vulnerability Identification via Graph Neural Networks

This project addresses the challenge of effective software vulnerability identification by leveraging graph neural networks (GNNs) to learn comprehensive program semantics. It targets researchers and engineers developing tools for static code analysis and vulnerability detection, offering a novel approach that moves beyond traditional methods by analyzing code structure as graphs.

How It Works

Devign utilizes Code Property Graphs (CPGs) generated by the Joern tool to represent code. The core approach involves embedding graph nodes, currently focusing on Abstract Syntax Trees (ASTs), to capture program semantics. These embeddings are then fed into a GNN model for vulnerability classification. This graph-based representation allows the model to learn complex relationships within the code structure, aiming for more effective vulnerability detection than methods relying solely on sequential or simpler structural analysis.

Quick Start & Requirements

  • Primary install/run: Clone the repository and execute tasks using python main.py with flags (e.g., -c for Create, -e for Embed, -p for Process).
  • Non-default prerequisites: Joern command-line tools, Python (>=3.6), Pandas (>=1.0.1), scikit-learn (>=0.22.2), PyTorch (>=1.4.0), PyTorch Geometric (>=1.4.2), Gensim (>=3.8.1), cpgclientlib (>=0.11.111). Note that PyTorch Geometric dependencies require careful matching with PyTorch.
  • Estimated setup: Joern processing can be slow and resource-intensive, potentially impacting system performance.
  • Links: Joern documentation page.

Highlighted Details

  • Generates Code Property Graphs (CPGs) using Joern, though current model training primarily uses Abstract Syntax Tree (AST) embeddings.
  • Employs a three-stage workflow: CPG creation, tokenization and embedding generation (Word2Vec), and model training/evaluation.
  • Provides a main.py script to orchestrate the Create, Embed, and Process tasks.
  • Reports performance metrics such as accuracy, precision, recall, and AUC on a sample FFmpeg dataset.

Maintenance & Community

The project is explicitly stated as "under development." The roadmap can be found in the open issues. Key authors include Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. Contact is via Eduard Pinconschi.

Licensing & Compatibility

Distributed under the MIT License, which is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is currently under active development. The GNN model's input is limited to AST embeddings, not the full CPG semantics (including CFG and PDG). Joern's CPG generation can be slow and resource-intensive. The project is not yet available as a pip-installable package.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), and
6 more.

awesome-machine-learning-on-source-code by src-d

0.1%
7k
Curated list of ML applied to source code (MLonCode)
Created 8 years ago
Updated 5 years ago
Feedback? Help us improve.