wikipedia-semantic-search  by upstash

Semantic search engine and RAG chatbot using Wikipedia data

created 1 year ago
469 stars

Top 65.7% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a semantic search engine and RAG chatbot built on Wikipedia data, targeting developers and researchers interested in vector databases and RAG applications. It demonstrates indexing millions of Wikipedia articles for efficient, cross-lingual semantic search and conversational AI.

How It Works

The system leverages Upstash Vector for storing and querying millions of vector embeddings generated from Wikipedia articles. It utilizes the BGE-M3 embedding model, enabling multilingual semantic search capabilities. A RAG chatbot is implemented using the Upstash RAG Chat SDK, with chat sessions persisted in Upstash Redis and LLM interactions managed via QStash LLM APIs, powered by Meta-Llama-3-8B-Instruct.

Quick Start & Requirements

  • Install dependencies: pnpm install
  • Run development server: pnpm dev
  • Prerequisites: Upstash Vector database (with BGE-M3 model), Upstash Redis database, QStash credentials.
  • Configuration: Requires a .env file with UPSTASH_VECTOR_REST_URL, UPSTASH_VECTOR_REST_TOKEN, UPSTASH_REDIS_REST_TOKEN, UPSTASH_REDIS_REST_URL, and QSTASH_TOKEN.
  • Data Indexing: Vectors must be upserted into appropriate namespaces (e.g., en for English).
  • Live Demo: https://wikipedia-semantic-search.upstash.dev/

Highlighted Details

  • Indexed over 144 million vectors from Wikipedia articles across 11 languages.
  • Utilizes BGE-M3 embedding model for robust multilingual support.
  • Implements semantic search with cross-lingual querying capabilities.
  • Features a RAG chatbot powered by Upstash RAG Chat SDK and Meta-Llama-3-8B-Instruct.

Maintenance & Community

The project is maintained by Upstash. Contributions are welcome via issues and pull requests. Further contact information can be found in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking would require clarification on the licensing terms.

Limitations & Caveats

The project relies heavily on Upstash services, potentially creating vendor lock-in. The setup requires obtaining and configuring credentials for multiple Upstash services. The README does not detail performance benchmarks or specific hardware requirements beyond the need for Upstash service access.

Health Check
Last commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic) and Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX).

semantic-cache by upstash

1.1%
281
Semantic cache for natural language tasks
created 1 year ago
updated 8 months ago
Feedback? Help us improve.