lunr-languages by MihaiValentin

Fast, multilingual search for AI and edge applications

Created 12 years ago

454 stars

Top 65.6% on SourcePulse

Project Summary

Summary Lunr Languages provides a collection of language stemmers and stopwords for the Lunr.js JavaScript search library, enabling fast, multilingual full-text search. It serves developers building search capabilities into AI, RAG, local-first applications, and static sites, offering a lightweight, zero-infrastructure retrieval layer that enhances context retrieval for LLMs.

How It Works

This project extends Lunr.js by integrating language-specific tokenization, stemming, and stopword filtering for over 30 languages. Its core advantage lies in delivering efficient, consistent lexical retrieval without requiring external databases or complex infrastructure, making it ideal for client-side or Node.js environments. Advanced Chinese tokenization leverages Intl.Segmenter for browser compatibility and offers optional integration with @node-rs/jieba in Node.js for improved segmentation quality.

Quick Start & Requirements

Installation: npm install lunr-languages
Prerequisites: Node.js or modern browser environment. Chinese tokenization in browsers requires Intl.Segmenter support. For enhanced Chinese segmentation in Node.js, install @node-rs/jieba.
Links: Usage examples provided in the README serve as a quick start guide.

Highlighted Details

Supports 30+ languages with dedicated stemmers and stopwords.
Functions as a lightweight retrieval layer for AI systems, including RAG and hybrid search.
Operates entirely client-side or in Node.js, requiring zero infrastructure.
Improves search recall and precision for non-English, inflected, or mixed-language datasets.
Offers robust Chinese tokenization options for different environments.

Maintenance & Community

Maintained as an open-source project for over a decade, the project seeks sponsorship or contributions to ensure continued stability and development. No specific community channels (e.g., Discord, Slack) are listed.

Licensing & Compatibility

The license type is not explicitly stated in the provided README content, which may impact commercial adoption or integration. The library is designed for browser and Node.js environments.

Limitations & Caveats

Chinese tokenization in browsers is dependent on Intl.Segmenter availability, with no bundled fallback. In Node.js, the fallback to Intl.Segmenter (when @node-rs/jieba is absent) may yield less precise results for Chinese text. The absence of a clearly stated license is a notable caveat for adoption.

lunr-languages by MihaiValentin

Explore Similar Projects

yacy_expert by yacy

nixiesearch by nixiesearch

vectordb by kagisearch

mcp-local-rag by shinpr

chatWeb by SkywalkerDarren

rag-search by thinkany-ai

trieve by devflowinc

orama by oramasearch

searchkick by ankane

lancedb by lancedb

qdrant by qdrant

meilisearch by meilisearch