esci-data  by amazon-science

Benchmark dataset for product search R&D

created 3 years ago
305 stars

Top 88.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the "Shopping Queries Data Set," a large-scale, multilingual benchmark for improving product search and semantic matching. It is designed for researchers and engineers working on query-product ranking, multi-class product classification (Exact, Substitute, Complement, Irrelevant), and product substitute identification. The dataset enables the development and evaluation of new ranking strategies and identification of product relationships.

How It Works

The dataset consists of query-product pairs with ESCI relevance judgments, along with product details. It is provided in two versions: a reduced version for ranking tasks and a larger version for classification and identification tasks. The data is stratified into training and testing splits and includes queries in English, Japanese, and Spanish. The project also offers baseline implementations using BERT and MPNet models for the defined tasks.

Quick Start & Requirements

  • Install dependencies using pip install -r requirements.txt.
  • Requires Python 3.6, numpy, pandas, transformers, scikit-learn, and sentence-transformers.
  • Baseline experiments utilize scripts like ./launch-experiments-taskK.sh and ./launch-predictions-taskK.sh.
  • Task 1 ranking models require the terrier source code (version 5.5).
  • Official documentation and baseline results are available.

Highlighted Details

  • Supports three distinct tasks: Query-Product Ranking, Multi-class Product Classification, and Product Substitute Identification.
  • Includes a reduced dataset (48,300 queries, 1.1M judgments) and a larger dataset (130,652 queries, 2.6M judgments).
  • Baseline results show nDCG of 0.83 for Task 1 and Macro F1 scores of 0.23 and 0.44 for Tasks 2 and 3, respectively.
  • Data includes query text, product details (title, description, brand, color), and ESCI labels.

Maintenance & Community

The project is associated with Amazon Science. Further community interaction details are not explicitly provided in the README.

Licensing & Compatibility

  • Licensed under the Apache-2.0 License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The dataset is primarily a data release; while baseline scripts are provided, extensive model training and fine-tuning are left to the user. Reproducing Task 1 results requires external dependencies like the terrier source code.

Health Check
Last commit

10 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.