domains by tb0hdan

Internet domains dataset for battling phishing attacks and research

Created 6 years ago

1,074 stars

Top 35.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Luca Soldaini

Research Scientist at Ai2

Project Summary

This repository provides the world's largest publicly available dataset of Internet domains, aimed at researchers and security professionals. It offers a massive, sorted list of domains, processed from petabytes of internet traffic, enabling large-scale analysis without requiring users to manage the data collection themselves.

How It Works

The project crawls the internet using Scrapy and Colly frameworks, with partial robots.txt support and rate limiting implemented. It collects domains and performs DNS checks via its "Freya" tool, which is still under development. The dataset is stored using Git LFS and compressed with XZ, requiring specific tools for retrieval and unpacking.

Quick Start & Requirements

Install: git clone https://github.com/tb0hdan/domains.git && cd domains && git lfs install && ./unpack.sh
Prerequisites: Git LFS, XZ utilities.
Data Size: Unpacked data for 1.7 billion domains is ~49GB.
More Info: Domains Project

Highlighted Details

Contains 1.7 billion domains, with larger datasets available via Patreon.
Processes up to 8.1PB of internet traffic for data collection.
Offers additional features like TLD-only lists and a WebSocket for new domains.
Data is gathered using Scrapy and Colly, with ongoing development of custom crawlers (Idun) and DNS checkers (Freya).

Maintenance & Community

The project is actively maintained by tb0hdan. Support is requested through linking, contributing datasets, publishing research, and sponsorships.

Licensing & Compatibility

The dataset is provided for research purposes. Specific licensing details for the dataset itself are not explicitly stated in the README, but the project is hosted on GitHub under a standard open-source repository structure.

Limitations & Caveats

Access to larger, more recent domain counts (2.4B, 2.5B) is restricted to Patreon subscribers. The "Freya" DNS checking client is in early stages and not yet stable for general public use.

Health Check

Last Commit

1 week ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

15 stars in the last 30 days