Research paper code for zero-shot dense retrieval
Top 59.7% on sourcepulse
HyDE is a zero-shot dense retrieval method that leverages GPT-3 to generate synthetic documents, which are then encoded by Contriever for efficient embedding-space search. This approach eliminates the need for human-labeled relevance data, significantly improving retrieval performance across tasks and languages compared to existing unsupervised methods like Contriever.
How It Works
HyDE instructs GPT-3 to generate a plausible, yet fictional, document for a given query. This synthetic document is then re-encoded using the unsupervised Contriever model. The resulting embedding is used to query an embedding space, enabling precise retrieval without requiring any relevance judgments. This method capitalizes on the generative capabilities of large language models to create rich representations for retrieval.
Quick Start & Requirements
wget https://www.dropbox.com/s/dytqaqngaupp884/contriever_msmarco_index.tar.gz
export OPENAI=<your key>
hyde-dl19.ipynb
or hyde-demo.ipynb
Highlighted Details
Maintenance & Community
The project is associated with the paper "HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels" by Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Further community engagement details are not provided in the README.
Licensing & Compatibility
The README does not explicitly state a license. The code is provided for research purposes, and commercial use or closed-source linking compatibility is not specified.
Limitations & Caveats
The method relies on GPT-3, which requires API access and incurs costs. The performance is dependent on the quality of synthetic documents generated by GPT-3 and the effectiveness of the Contriever model.
7 months ago
1 week