SearchPaperByEmbedding  by gyj155

Semantic search for academic papers

Created 2 weeks ago

New!

314 stars

Top 85.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository offers a tool for semantic search of academic papers, enabling users to find similar research using embedding models. It caters to researchers and engineers needing to efficiently discover relevant literature, providing flexibility with both free local models and higher-quality OpenAI API integration.

How It Works

The system converts paper titles and abstracts into vector embeddings, which are then cached for performance. User queries (either example papers or text descriptions) are embedded using the same model. Cosine similarity is employed to calculate the relevance between the query embedding and the cached paper embeddings, ranking results by similarity score.

Quick Start & Requirements

  • Installation: pip install -r requirements.txt
  • Prerequisites: Python environment, listed dependencies. An OpenAI API key is required for the OpenAI model.
  • Usage:
    1. Crawl papers from sources like OpenReview using crawl_papers.
    2. Initialize PaperSearcher with paper data and select model_type='local' or model_type='openai'.
    3. Compute embeddings and perform searches via searcher.search().
    4. Results can be displayed or saved.
  • Links: No official documentation or demo links provided.

Highlighted Details

  • Supports fetching papers directly from OpenReview.
  • Offers choice between free, local embedding models (e.g., all-MiniLM-L6-v2) and paid, higher-fidelity OpenAI embeddings.
  • Features automatic caching of computed embeddings to accelerate subsequent searches.
  • Search can be initiated using one or more example papers or a descriptive text query.

Maintenance & Community

No information regarding maintainers, community channels (like Discord/Slack), or project roadmap is present in the README.

Licensing & Compatibility

The README does not specify a software license. This lack of clarity presents a significant adoption blocker, particularly for commercial or closed-source integration.

Limitations & Caveats

The project's effectiveness is tied to the quality of the chosen embedding model. Preparation of paper data is a necessary prerequisite. The absence of explicit licensing information is a critical caveat for evaluating adoption.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
316 stars in the last 15 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Simon Willison Simon Willison(Coauthor of Django).

semantra by freedmand

0.1%
3k
CLI tool for semantic document search
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.