WordLlama  by dleemiller

NLP toolkit for leveraging LLM token embeddings

created 1 year ago
1,445 stars

Top 28.9% on sourcepulse

GitHubView on GitHub
Project Summary

WordLlama is a lightweight NLP toolkit for efficient text similarity, deduplication, ranking, and clustering, optimized for CPU usage. It leverages recycled token embeddings from large language models (LLMs) to provide fast, compact word representations, making it ideal for resource-constrained environments and rapid prototyping.

How It Works

WordLlama extracts token embedding codebooks from LLMs (e.g., LLaMA 2, LLaMA 3) and trains a small, context-less model using average pooling. This approach yields compact embeddings (e.g., 16MB for a 256-dimensional model) that outperform traditional models like GloVe on MTEB benchmarks. It supports Matryoshka Representations for flexible dimension truncation and binary embeddings with Hamming similarity for accelerated computations.

Quick Start & Requirements

  • Install via pip: pip install wordllama
  • Load default model: from wordllama import WordLlama; wl = WordLlama.load()
  • CPU optimized, no GPU required.
  • Official Docs: https://github.com/dleemiller/wordllama

Highlighted Details

  • Achieves competitive performance on MTEB benchmarks, outperforming GloVe 300d with significantly smaller model sizes.
  • Features Matryoshka Representations for adjustable embedding dimensions.
  • Supports binary embeddings for fast Hamming distance calculations.
  • Numpy-only inference pipeline for easy deployment.
  • Includes functionality for semantic text splitting.

Maintenance & Community

  • Active development with recent updates in early 2025.
  • Community support via HF Space and Gradio Demo.
  • Citation provided for research use.

Licensing & Compatibility

  • MIT License.
  • Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is primarily focused on CPU performance; GPU acceleration is not explicitly detailed. While MTEB results are provided, direct comparisons to the latest state-of-the-art embedding models are not always present.

Health Check
Last commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Didier Lopes Didier Lopes(Founder of OpenBB), and
11 more.

sentence-transformers by UKPLab

0.2%
17k
Framework for text embeddings, retrieval, and reranking
created 6 years ago
updated 3 days ago
Feedback? Help us improve.