Benchmark dataset for product search R&D
Top 88.9% on sourcepulse
This repository provides the "Shopping Queries Data Set," a large-scale, multilingual benchmark for improving product search and semantic matching. It is designed for researchers and engineers working on query-product ranking, multi-class product classification (Exact, Substitute, Complement, Irrelevant), and product substitute identification. The dataset enables the development and evaluation of new ranking strategies and identification of product relationships.
How It Works
The dataset consists of query-product pairs with ESCI relevance judgments, along with product details. It is provided in two versions: a reduced version for ranking tasks and a larger version for classification and identification tasks. The data is stratified into training and testing splits and includes queries in English, Japanese, and Spanish. The project also offers baseline implementations using BERT and MPNet models for the defined tasks.
Quick Start & Requirements
pip install -r requirements.txt
../launch-experiments-taskK.sh
and ./launch-predictions-taskK.sh
.Highlighted Details
Maintenance & Community
The project is associated with Amazon Science. Further community interaction details are not explicitly provided in the README.
Licensing & Compatibility
Limitations & Caveats
The dataset is primarily a data release; while baseline scripts are provided, extensive model training and fine-tuning are left to the user. Reproducing Task 1 results requires external dependencies like the terrier source code.
10 months ago
1 week