hyde  by texttron

Research paper code for zero-shot dense retrieval

created 2 years ago
539 stars

Top 59.7% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

HyDE is a zero-shot dense retrieval method that leverages GPT-3 to generate synthetic documents, which are then encoded by Contriever for efficient embedding-space search. This approach eliminates the need for human-labeled relevance data, significantly improving retrieval performance across tasks and languages compared to existing unsupervised methods like Contriever.

How It Works

HyDE instructs GPT-3 to generate a plausible, yet fictional, document for a given query. This synthetic document is then re-encoded using the unsupervised Contriever model. The resulting embedding is used to query an embedding space, enabling precise retrieval without requiring any relevance judgments. This method capitalizes on the generative capabilities of large language models to create rich representations for retrieval.

Quick Start & Requirements

  • Install pyserini: https://github.com/castorini/pyserini
  • Download Contriever FAISS index: wget https://www.dropbox.com/s/dytqaqngaupp884/contriever_msmarco_index.tar.gz
  • Set GPT-3 API key: export OPENAI=<your key>
  • Run experiments: hyde-dl19.ipynb or hyde-demo.ipynb
  • Requires Python, OpenAI API access, and the Contriever FAISS index.

Highlighted Details

  • Outperforms Contriever across tasks and languages in zero-shot settings.
  • Eliminates the need for human-labeled relevance data.
  • Leverages GPT-3 for synthetic document generation and Contriever for retrieval.

Maintenance & Community

The project is associated with the paper "HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels" by Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Further community engagement details are not provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. The code is provided for research purposes, and commercial use or closed-source linking compatibility is not specified.

Limitations & Caveats

The method relies on GPT-3, which requires API access and incurs costs. The performance is dependent on the quality of synthetic documents generated by GPT-3 and the effectiveness of the Contriever model.

Health Check
Last commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.