malware-detection by dchad

Malware detection via ML experiments

created 9 years ago

340 stars

Top 82.2% on sourcepulse

Project Summary

This repository explores malware detection and classification using machine learning, targeting security researchers and developers. It provides a framework for feature engineering, selection, and model training on large malware datasets, aiming to improve detection accuracy.

How It Works

The project employs a multi-stage approach to malware analysis. It begins with feature extraction from disassembled binaries (ASM), file metadata, and call graphs. Techniques like chi-squared tests are used for feature selection, prioritizing features with high variance and predictive power. Various classifiers, including ExtraTreesClassifier, XGBoost, and LightGBM, are evaluated, with ensemble methods and stacked models explored for enhanced performance.

Quick Start & Requirements

Installation: Requires Python 2.7 for Cuckoo Sandbox. Dependencies include numpy, scipy, scikit-learn, matplotlib, jupyter, pandas, xgboost, cython. For disassembly, binutils with multi-architecture support is needed.
Data: Large malware datasets (e.g., VirusShare.com archives, ~25GB each) are required for training.
Tools: Cuckoo Sandbox, IDA Pro, Volatility, TrID, ClamAV are used for analysis and feature extraction.
Setup: Significant setup time is expected due to the complexity of installing dependencies and acquiring/processing large datasets. Detailed setup instructions are provided for Debian/Ubuntu and Windows.

Highlighted Details

Achieved 99.81% accuracy with XGBoost on a smaller dataset using 623 features.
Explores feature sets derived from ASM, file entropy, file size, call graphs, and behavioral analysis.
Investigates ensemble methods and stacked classifiers for improved robustness.
Includes detailed workflows for training label generation and feature engineering for PE/COFF, ELF, Java, Javascript, and PDF files.

Maintenance & Community

The repository appears to be a personal project with no explicit mention of active maintenance, contributors, or community channels.

Licensing & Compatibility

The repository does not explicitly state a license. The included tools (Cuckoo Sandbox, IDA Pro, etc.) have their own licenses, some of which may restrict commercial use or require specific compatibility.

malware-detection by dchad

Explore Similar Projects

upgini by upgini

robustlearn by microsoft

sharem by Bw3ll

AndroPyTool by alexMyG

awesome-local-ai by menloresearch

gym-malware by endgameinc

AutoDL by DeepWisdom

benchm-ml by szilard

CAPEv2 by kevoreilly

pyod by yzhao062

Scanners-Box by We5ter

awesome-malware-analysis by rshipp