esci-data by amazon-science

Benchmark dataset for product search R&D

Created 3 years ago

346 stars

Top 80.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Omar Khattab

Coauthor of DSPy, ColBERT; Professor at MIT

Project Summary

This repository provides the "Shopping Queries Data Set," a large-scale, multilingual benchmark for improving product search and semantic matching. It is designed for researchers and engineers working on query-product ranking, multi-class product classification (Exact, Substitute, Complement, Irrelevant), and product substitute identification. The dataset enables the development and evaluation of new ranking strategies and identification of product relationships.

How It Works

The dataset consists of query-product pairs with ESCI relevance judgments, along with product details. It is provided in two versions: a reduced version for ranking tasks and a larger version for classification and identification tasks. The data is stratified into training and testing splits and includes queries in English, Japanese, and Spanish. The project also offers baseline implementations using BERT and MPNet models for the defined tasks.

Quick Start & Requirements

Install dependencies using pip install -r requirements.txt.
Requires Python 3.6, numpy, pandas, transformers, scikit-learn, and sentence-transformers.
Baseline experiments utilize scripts like ./launch-experiments-taskK.sh and ./launch-predictions-taskK.sh.
Task 1 ranking models require the terrier source code (version 5.5).
Official documentation and baseline results are available.

Highlighted Details

Supports three distinct tasks: Query-Product Ranking, Multi-class Product Classification, and Product Substitute Identification.
Includes a reduced dataset (48,300 queries, 1.1M judgments) and a larger dataset (130,652 queries, 2.6M judgments).
Baseline results show nDCG of 0.83 for Task 1 and Macro F1 scores of 0.23 and 0.44 for Tasks 2 and 3, respectively.
Data includes query text, product details (title, description, brand, color), and ESCI labels.

Maintenance & Community

The project is associated with Amazon Science. Further community interaction details are not explicitly provided in the README.

Licensing & Compatibility

Licensed under the Apache-2.0 License.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The dataset is primarily a data release; while baseline scripts are provided, extensive model training and fine-tuning are left to the user. Reproducing Task 1 results requires external dependencies like the terrier source code.

esci-data by amazon-science

Explore Similar Projects

MS-MARCO-Web-Search by microsoft

in-context-ralm by AI21Labs

MultiHop-RAG by yixuantt

denser-retriever by denser-org

awesome-pretrained-models-for-information-retrieval by ict-bigdatalab

stark by snap-stanford

atlas by facebookresearch

superlinked by superlinked

awesome-search by frutik

Local_Pdf_Chat_RAG by weiwill88

pyserini by castorini

typesense by typesense