semantra  by freedmand

CLI tool for semantic document search

Created 2 years ago
2,654 stars

Top 17.8% on SourcePulse

GitHubView on GitHub
Project Summary

Semantra is a command-line tool for semantic document search, enabling users to query text and PDF files by meaning rather than exact keyword matching. It's designed for individuals like journalists, researchers, and students who need to efficiently find information within large document sets, offering a private, configurable, and user-friendly experience.

How It Works

Semantra analyzes documents by converting text into numerical embeddings using transformer models. These embeddings capture semantic meaning, allowing for searches based on conceptual similarity. The tool then launches a local web interface for interactive querying, where results are ranked by relevance and can be refined using positive or negative feedback on specific snippets. This approach prioritizes direct interaction with source material over generative AI summaries.

Quick Start & Requirements

  • Install: python3 -m pipx install semantra (requires Python >= 3.9 and pipx).
  • Prerequisites: Local machine learning model download (several hundred MB) on first run.
  • Usage: semantra doc.pdf or semantra file1.txt file2.pdf.
  • Resources: Tutorial: https://github.com/freedmand/semantra#tutorial

Highlighted Details

  • Supports local embedding models (e.g., minilm, mpnet) or OpenAI's API.
  • Configurable embedding window sizes and overlap for text processing.
  • Optional Annoy indexing for faster approximate nearest neighbor search.
  • Web interface allows positive/negative feedback to sculpt future queries.

Maintenance & Community

Contributions are welcome; issues and feature requests can be submitted via GitHub.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

Semantra does not utilize generative AI models like ChatGPT, focusing solely on semantic search and presenting raw results. The initial document processing can be time-consuming.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 30 days

Explore Similar Projects

Starred by Vaibhav Nivargi Vaibhav Nivargi(Cofounder of Moveworks), Jared Palmer Jared Palmer(Ex-VP AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
4 more.

searchkick by ankane

0.1%
7k
Ruby gem for integrating intelligent search
Created 12 years ago
Updated 1 week ago
Starred by John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), Simon Horup Eskildsen Simon Horup Eskildsen(Cofounder of Turbopuffer), and
21 more.

meilisearch by meilisearch

0.2%
53k
Search engine API for integrating AI-powered hybrid search
Created 7 years ago
Updated 1 day ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Taranjeet Singh Taranjeet Singh(Cofounder of Mem0), and
8 more.

Perplexica by ItzCrazyKns

5.7%
25k
AI-powered search engine alternative
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.