malware-detection  by dchad

Malware detection via ML experiments

created 9 years ago
340 stars

Top 82.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository explores malware detection and classification using machine learning, targeting security researchers and developers. It provides a framework for feature engineering, selection, and model training on large malware datasets, aiming to improve detection accuracy.

How It Works

The project employs a multi-stage approach to malware analysis. It begins with feature extraction from disassembled binaries (ASM), file metadata, and call graphs. Techniques like chi-squared tests are used for feature selection, prioritizing features with high variance and predictive power. Various classifiers, including ExtraTreesClassifier, XGBoost, and LightGBM, are evaluated, with ensemble methods and stacked models explored for enhanced performance.

Quick Start & Requirements

  • Installation: Requires Python 2.7 for Cuckoo Sandbox. Dependencies include numpy, scipy, scikit-learn, matplotlib, jupyter, pandas, xgboost, cython. For disassembly, binutils with multi-architecture support is needed.
  • Data: Large malware datasets (e.g., VirusShare.com archives, ~25GB each) are required for training.
  • Tools: Cuckoo Sandbox, IDA Pro, Volatility, TrID, ClamAV are used for analysis and feature extraction.
  • Setup: Significant setup time is expected due to the complexity of installing dependencies and acquiring/processing large datasets. Detailed setup instructions are provided for Debian/Ubuntu and Windows.

Highlighted Details

  • Achieved 99.81% accuracy with XGBoost on a smaller dataset using 623 features.
  • Explores feature sets derived from ASM, file entropy, file size, call graphs, and behavioral analysis.
  • Investigates ensemble methods and stacked classifiers for improved robustness.
  • Includes detailed workflows for training label generation and feature engineering for PE/COFF, ELF, Java, Javascript, and PDF files.

Maintenance & Community

The repository appears to be a personal project with no explicit mention of active maintenance, contributors, or community channels.

Licensing & Compatibility

The repository does not explicitly state a license. The included tools (Cuckoo Sandbox, IDA Pro, etc.) have their own licenses, some of which may restrict commercial use or require specific compatibility.

Limitations & Caveats

  • The project heavily relies on Python 2.7 for key components like Cuckoo Sandbox, which is end-of-life.
  • Setup is complex, requiring significant system configuration and large data downloads.
  • Performance metrics are primarily from older benchmarks and may not reflect current state-of-the-art.
  • No explicit license is provided, raising questions about usage rights.
Health Check
Last commit

8 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.