KeyBERT by MaartenGr

Keyword extraction tool using BERT embeddings

Created 5 years ago

4,074 stars

Top 12.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Philipp Schmid

DevRel at Google DeepMind

Project Summary

KeyBERT provides a minimal and easy-to-use library for extracting keywords and keyphrases from documents using BERT embeddings. It is designed for beginners and researchers looking for a straightforward, powerful method that requires minimal setup.

How It Works

KeyBERT leverages BERT embeddings to find sub-phrases most similar to the document's overall meaning. It first generates a document embedding, then extracts embeddings for candidate N-grams within the text. Cosine similarity is used to identify the N-grams most semantically similar to the document embedding, serving as the extracted keywords. Advanced techniques like Max Sum Distance and Maximal Marginal Relevance (MMR) are available for diversifying keyword results.

Quick Start & Requirements

Install via pip: pip install keybert
Optional backends: keybert[flair], keybert[gensim], keybert[spacy], keybert[use]
For PyTorch-free installation: pip install keybert --no-deps scikit-learn model2vec
Recommended embedding models: all-MiniLM-L6-v2 (English), paraphrase-multilingual-MiniLM-L12-v2 (multilingual).
Full documentation: https://github.com/MaartenGr/KeyBERT

Highlighted Details

Supports various embedding models including Sentence-Transformers, Flair, SpaCy, and USE.
Offers Max Sum Distance and Maximal Marginal Relevance (MMR) for result diversification.
Integrates with Large Language Models (LLMs) for keyword extraction via OpenAI.
Minimal usage: pip install keybert and 3 lines of code.

Maintenance & Community

Developed by Maarten Grootendorst.
BibTeX citation available for academic use.
Open to suggestions for new papers or repositories.

Licensing & Compatibility

License: Apache 2.0.
Compatible with commercial use and closed-source projects.

Limitations & Caveats

The library relies on pre-trained models, and performance is dependent on the chosen embedding model's quality and suitability for the input text. LLM integration requires an OpenAI API key and incurs associated costs.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

13 stars in the last 30 days