SearchPaperByEmbedding by gyj155

Semantic search for academic papers

Created 2 months ago

374 stars

Top 75.8% on SourcePulse

Project Summary

This repository offers a tool for semantic search of academic papers, enabling users to find similar research using embedding models. It caters to researchers and engineers needing to efficiently discover relevant literature, providing flexibility with both free local models and higher-quality OpenAI API integration.

How It Works

The system converts paper titles and abstracts into vector embeddings, which are then cached for performance. User queries (either example papers or text descriptions) are embedded using the same model. Cosine similarity is employed to calculate the relevance between the query embedding and the cached paper embeddings, ranking results by similarity score.

Quick Start & Requirements

Installation: pip install -r requirements.txt
Prerequisites: Python environment, listed dependencies. An OpenAI API key is required for the OpenAI model.
Usage:
1. Crawl papers from sources like OpenReview using crawl_papers.
2. Initialize PaperSearcher with paper data and select model_type='local' or model_type='openai'.
3. Compute embeddings and perform searches via searcher.search().
4. Results can be displayed or saved.
Links: No official documentation or demo links provided.

Highlighted Details

Supports fetching papers directly from OpenReview.
Offers choice between free, local embedding models (e.g., all-MiniLM-L6-v2) and paid, higher-fidelity OpenAI embeddings.
Features automatic caching of computed embeddings to accelerate subsequent searches.
Search can be initiated using one or more example papers or a descriptive text query.

Maintenance & Community

No information regarding maintainers, community channels (like Discord/Slack), or project roadmap is present in the README.

Licensing & Compatibility

The README does not specify a software license. This lack of clarity presents a significant adoption blocker, particularly for commercial or closed-source integration.

Limitations & Caveats

The project's effectiveness is tied to the quality of the chosen embedding model. Preparation of paper data is a necessary prerequisite. The absence of explicit licensing information is a critical caveat for evaluating adoption.

SearchPaperByEmbedding by gyj155

Explore Similar Projects

ICLR26_Paper_Finder by wenhangao21

awesome-document-similarity by malteos

Semantic-Retrieval-Models by caiyinqiong

similarity-search-kit by ZachNagengast

ScholarXIV by dagmawibabi

awesome-pretrained-models-for-information-retrieval by ict-bigdatalab

fiftyone-docs-search by voxel51

rag-fusion by Raudaschl

semantra by freedmand

PageIndex by VectifyAI

pyserini by castorini

paper-qa by Future-House