hyde by texttron

Research paper code for zero-shot dense retrieval

Created 3 years ago

565 stars

Top 56.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Project Summary

HyDE is a zero-shot dense retrieval method that leverages GPT-3 to generate synthetic documents, which are then encoded by Contriever for efficient embedding-space search. This approach eliminates the need for human-labeled relevance data, significantly improving retrieval performance across tasks and languages compared to existing unsupervised methods like Contriever.

How It Works

HyDE instructs GPT-3 to generate a plausible, yet fictional, document for a given query. This synthetic document is then re-encoded using the unsupervised Contriever model. The resulting embedding is used to query an embedding space, enabling precise retrieval without requiring any relevance judgments. This method capitalizes on the generative capabilities of large language models to create rich representations for retrieval.

Quick Start & Requirements

Install pyserini: https://github.com/castorini/pyserini
Download Contriever FAISS index: wget https://www.dropbox.com/s/dytqaqngaupp884/contriever_msmarco_index.tar.gz
Set GPT-3 API key: export OPENAI=<your key>
Run experiments: hyde-dl19.ipynb or hyde-demo.ipynb
Requires Python, OpenAI API access, and the Contriever FAISS index.

Highlighted Details

Outperforms Contriever across tasks and languages in zero-shot settings.
Eliminates the need for human-labeled relevance data.
Leverages GPT-3 for synthetic document generation and Contriever for retrieval.

Maintenance & Community

The project is associated with the paper "HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels" by Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Further community engagement details are not provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. The code is provided for research purposes, and commercial use or closed-source linking compatibility is not specified.

Limitations & Caveats

The method relies on GPT-3, which requires API access and incurs costs. The performance is dependent on the quality of synthetic documents generated by GPT-3 and the effectiveness of the Contriever model.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days