Discover and explore top open-source AI tools and projects—updated daily.
Internet domains dataset for battling phishing attacks and research
Top 44.1% on SourcePulse
This repository provides the world's largest publicly available dataset of Internet domains, aimed at researchers and security professionals. It offers a massive, sorted list of domains, processed from petabytes of internet traffic, enabling large-scale analysis without requiring users to manage the data collection themselves.
How It Works
The project crawls the internet using Scrapy and Colly frameworks, with partial robots.txt support and rate limiting implemented. It collects domains and performs DNS checks via its "Freya" tool, which is still under development. The dataset is stored using Git LFS and compressed with XZ, requiring specific tools for retrieval and unpacking.
Quick Start & Requirements
git clone https://github.com/tb0hdan/domains.git && cd domains && git lfs install && ./unpack.sh
Highlighted Details
Maintenance & Community
The project is actively maintained by tb0hdan. Support is requested through linking, contributing datasets, publishing research, and sponsorships.
Licensing & Compatibility
The dataset is provided for research purposes. Specific licensing details for the dataset itself are not explicitly stated in the README, but the project is hosted on GitHub under a standard open-source repository structure.
Limitations & Caveats
Access to larger, more recent domain counts (2.4B, 2.5B) is restricted to Patreon subscribers. The "Freya" DNS checking client is in early stages and not yet stable for general public use.
1 month ago
1+ week