KeyBERT  by MaartenGr

Keyword extraction tool using BERT embeddings

created 4 years ago
3,957 stars

Top 12.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

KeyBERT provides a minimal and easy-to-use library for extracting keywords and keyphrases from documents using BERT embeddings. It is designed for beginners and researchers looking for a straightforward, powerful method that requires minimal setup.

How It Works

KeyBERT leverages BERT embeddings to find sub-phrases most similar to the document's overall meaning. It first generates a document embedding, then extracts embeddings for candidate N-grams within the text. Cosine similarity is used to identify the N-grams most semantically similar to the document embedding, serving as the extracted keywords. Advanced techniques like Max Sum Distance and Maximal Marginal Relevance (MMR) are available for diversifying keyword results.

Quick Start & Requirements

  • Install via pip: pip install keybert
  • Optional backends: keybert[flair], keybert[gensim], keybert[spacy], keybert[use]
  • For PyTorch-free installation: pip install keybert --no-deps scikit-learn model2vec
  • Recommended embedding models: all-MiniLM-L6-v2 (English), paraphrase-multilingual-MiniLM-L12-v2 (multilingual).
  • Full documentation: https://github.com/MaartenGr/KeyBERT

Highlighted Details

  • Supports various embedding models including Sentence-Transformers, Flair, SpaCy, and USE.
  • Offers Max Sum Distance and Maximal Marginal Relevance (MMR) for result diversification.
  • Integrates with Large Language Models (LLMs) for keyword extraction via OpenAI.
  • Minimal usage: pip install keybert and 3 lines of code.

Maintenance & Community

  • Developed by Maarten Grootendorst.
  • BibTeX citation available for academic use.
  • Open to suggestions for new papers or repositories.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatible with commercial use and closed-source projects.

Limitations & Caveats

The library relies on pre-trained models, and performance is dependent on the chosen embedding model's quality and suitability for the input text. LLM integration requires an OpenAI API key and incurs associated costs.

Health Check
Last commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
129 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
5 more.

BERTopic by MaartenGr

0.2%
7k
Topic modeling with transformers and c-TF-IDF
created 4 years ago
updated 3 weeks ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Didier Lopes Didier Lopes(Founder of OpenBB), and
11 more.

sentence-transformers by UKPLab

0.2%
17k
Framework for text embeddings, retrieval, and reranking
created 6 years ago
updated 3 days ago
Feedback? Help us improve.