wikipedia-semantic-search by upstash

Semantic search engine and RAG chatbot using Wikipedia data

Created 1 year ago

473 stars

Top 64.4% on SourcePulse

Project Summary

This project provides a semantic search engine and RAG chatbot built on Wikipedia data, targeting developers and researchers interested in vector databases and RAG applications. It demonstrates indexing millions of Wikipedia articles for efficient, cross-lingual semantic search and conversational AI.

How It Works

The system leverages Upstash Vector for storing and querying millions of vector embeddings generated from Wikipedia articles. It utilizes the BGE-M3 embedding model, enabling multilingual semantic search capabilities. A RAG chatbot is implemented using the Upstash RAG Chat SDK, with chat sessions persisted in Upstash Redis and LLM interactions managed via QStash LLM APIs, powered by Meta-Llama-3-8B-Instruct.

Quick Start & Requirements

Install dependencies: pnpm install
Run development server: pnpm dev
Prerequisites: Upstash Vector database (with BGE-M3 model), Upstash Redis database, QStash credentials.
Configuration: Requires a .env file with UPSTASH_VECTOR_REST_URL, UPSTASH_VECTOR_REST_TOKEN, UPSTASH_REDIS_REST_TOKEN, UPSTASH_REDIS_REST_URL, and QSTASH_TOKEN.
Data Indexing: Vectors must be upserted into appropriate namespaces (e.g., en for English).
Live Demo: https://wikipedia-semantic-search.upstash.dev/

Highlighted Details

Indexed over 144 million vectors from Wikipedia articles across 11 languages.
Utilizes BGE-M3 embedding model for robust multilingual support.
Implements semantic search with cross-lingual querying capabilities.
Features a RAG chatbot powered by Upstash RAG Chat SDK and Meta-Llama-3-8B-Instruct.

Maintenance & Community

The project is maintained by Upstash. Contributions are welcome via issues and pull requests. Further contact information can be found in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking would require clarification on the licensing terms.

Limitations & Caveats

The project relies heavily on Upstash services, potentially creating vendor lock-in. The setup requires obtaining and configuring credentials for multiple Upstash services. The README does not detail performance benchmarks or specific hardware requirements beyond the need for Upstash service access.

wikipedia-semantic-search by upstash

Explore Similar Projects

yacy_expert by yacy

awsdocsgpt by antimetal

wait-but-why-gpt by mckaywrigley

semantic-search-nextjs-pinecone-langchain-chatgpt by dabit3

telegram-search by groupultra

BLINK by facebookresearch

DeepSeek-RAG-Chatbot by SaiAkhil066

RasaGPT by paulpierre

Chinese-LangChain by yanqiangmiffy

orama by oramasearch

typesense by typesense

meilisearch by meilisearch