MS-MARCO-Web-Search by microsoft

Web dataset for information retrieval research

Created 2 years ago

346 stars

Top 80.5% on SourcePulse

View on GitHub

2 Experts Love This Project

Eugene Yan

AI Scientist at AWS

Philipp Schmid

DevRel at Google DeepMind

Project Summary

This repository provides the MS MARCO Web Search dataset, a large-scale, information-rich collection designed for evaluating web search and retrieval systems. It targets researchers and engineers working on information retrieval, natural language processing, and machine learning, offering millions of real user query-document click labels and a comprehensive web document corpus. The dataset enables benchmarking of embedding models, embedding retrieval systems, and end-to-end retrieval solutions at web scale.

How It Works

The dataset leverages the ClueWeb22 corpus, comprising approximately 10 billion web pages, to simulate real-world web search conditions. It includes millions of unique queries across 93 languages, paired with millions of labeled query-document relevance judgments derived from Microsoft Bing search logs. The dataset is structured to support three distinct retrieval tasks: embedding model ranking, embedding retrieval (focusing on Approximate Nearest Neighbor - ANN performance), and end-to-end retrieval, allowing for comprehensive evaluation of system components and overall performance.

Quick Start & Requirements

Dataset Access: Download links for the 100M and 10B datasets are provided, including document mappings, query files, relevance judgments (qrels), and pre-computed embedding vectors (e.g., SimANS).
Corpus: Requires access to the ClueWeb22 collection (https://lemurproject.org/clueweb22.php/).
Dependencies: Specific tasks may require libraries for handling large datasets and performing retrieval (e.g., SPTAG for ANN, as indicated by vector formats).
Resource Footprint: The 100M dataset includes ~290GB of document embedding vectors and ~678MB of query data. The 10B dataset includes ~2.43GB of query relevance data.

Highlighted Details

Features millions of real clicked query-document labels from Microsoft Bing.
Incorporates the ClueWeb22 corpus (10 billion web pages) as the document corpus.
Supports three distinct retrieval tasks: embedding model, embedding retrieval, and end-to-end retrieval.
Provides baseline performance metrics (MRR, Recall, QPS, Latency) for various retrieval methods.

Maintenance & Community

Developed and maintained by Microsoft.
Contributions are welcome via pull requests, subject to a Contributor License Agreement (CLA).
Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

Documentation: Licensed under Creative Commons Attribution 4.0 International Public License (CC BY 4.0).
Code: Licensed under the MIT License.
Dataset Usage: Intended for non-commercial research purposes only. Use of the dataset carries risks as Microsoft may not own underlying rights to all documents.

Limitations & Caveats

The dataset is provided "as is" without warranty.
Usage is restricted to non-commercial research; commercial use requires independent legal review.
Users must not use Microsoft trademarks without permission.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days