semantra  by freedmand

CLI tool for semantic document search

created 2 years ago
2,634 stars

Top 18.3% on sourcepulse

GitHubView on GitHub
Project Summary

Semantra is a command-line tool for semantic document search, enabling users to query text and PDF files by meaning rather than exact keyword matching. It's designed for individuals like journalists, researchers, and students who need to efficiently find information within large document sets, offering a private, configurable, and user-friendly experience.

How It Works

Semantra analyzes documents by converting text into numerical embeddings using transformer models. These embeddings capture semantic meaning, allowing for searches based on conceptual similarity. The tool then launches a local web interface for interactive querying, where results are ranked by relevance and can be refined using positive or negative feedback on specific snippets. This approach prioritizes direct interaction with source material over generative AI summaries.

Quick Start & Requirements

  • Install: python3 -m pipx install semantra (requires Python >= 3.9 and pipx).
  • Prerequisites: Local machine learning model download (several hundred MB) on first run.
  • Usage: semantra doc.pdf or semantra file1.txt file2.pdf.
  • Resources: Tutorial: https://github.com/freedmand/semantra#tutorial

Highlighted Details

  • Supports local embedding models (e.g., minilm, mpnet) or OpenAI's API.
  • Configurable embedding window sizes and overlap for text processing.
  • Optional Annoy indexing for faster approximate nearest neighbor search.
  • Web interface allows positive/negative feedback to sculpt future queries.

Maintenance & Community

Contributions are welcome; issues and feature requests can be submitted via GitHub.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

Semantra does not utilize generative AI models like ChatGPT, focusing solely on semantic search and presenting raw results. The initial document processing can be time-consuming.

Health Check
Last commit

11 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
31 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.