TrafficLLM  by ZGC-LLM-Safety

LLM adaptation framework for network traffic analysis

Created 1 year ago
310 stars

Top 86.7% on SourcePulse

GitHubView on GitHub
Project Summary

TrafficLLM is a framework for adapting open-source Large Language Models (LLMs) to network traffic analysis tasks, enabling robust traffic representation and generalization across detection and generation scenarios. It targets researchers and practitioners in cybersecurity and network analysis seeking to leverage LLMs for understanding and manipulating network data.

How It Works

TrafficLLM employs a three-pronged approach: traffic-domain tokenization to bridge the gap between natural language and network data, a dual-stage tuning pipeline for instruction understanding and task-specific pattern learning, and Extensible Adaptation with Parameter-Effective Fine-Tuning (EA-PEFT) to efficiently adapt models to new traffic environments with minimal parameter updates.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n trafficllm python=3.9), activate it (conda activate trafficllm), and install dependencies (pip install -r requirements.txt). Additional packages (rouge_chinese, nltk, jieba, datasets) are needed for training.
  • Prerequisites: Base LLM checkpoints (e.g., ChatGLM2-6B, Llama2), raw traffic datasets for preprocessing. GPU acceleration is implied for training and inference.
  • Setup: Environment setup is straightforward. Training involves multiple stages, including data preprocessing and fine-tuning, which can be resource-intensive.
  • Resources: Preprint Paper, Tutorials, Adapt2GLM4.

Highlighted Details

  • Supports adaptation for various traffic analysis tasks including Malware Traffic Detection (MTD), Botnet Detection (BND), and Encrypted VPN Detection (EVD).
  • Provides over 0.4M traffic data samples and 9K human instructions for LLM fine-tuning.
  • Includes code for generating pcap files using Scapy for Wireshark compatibility.
  • Offers EA-PEFT for efficient, modular adaptation to new traffic patterns and tasks.

Maintenance & Community

The project is actively developed, with recent updates including support for GLM4 and packet generation capabilities. Links to community resources are not explicitly provided in the README.

Licensing & Compatibility

The repository is released under an unspecified license. The project acknowledges ChatGLM2 and Llama2 as foundational models, implying adherence to their respective licenses. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The project is based on specific LLM versions (ChatGLM2, Llama2), and adapting other LLMs may require significant modifications. The README mentions optional training of a custom traffic-domain tokenizer, suggesting that default tokenization might not cover all use cases.

Health Check
Last Commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
9 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.