BERTopic  by MaartenGr

Topic modeling with transformers and c-TF-IDF

created 4 years ago
6,929 stars

Top 7.5% on sourcepulse

GitHubView on GitHub
Project Summary

BERTopic is a Python library for topic modeling that leverages BERT embeddings and a class-based TF-IDF (c-TF-IDF) approach to generate easily interpretable topics. It is designed for researchers and data scientists working with large text corpora who need to uncover underlying themes and patterns. The library offers a flexible and modular architecture, allowing users to customize various components of the topic modeling pipeline.

How It Works

BERTopic combines several techniques: it first uses Sentence-Transformers to create document embeddings, then applies UMAP for dimensionality reduction, followed by HDBSCAN for clustering these reduced embeddings into topics. Finally, it uses a class-based TF-IDF (c-TF-IDF) to extract the most representative words for each topic. This multi-stage approach aims to produce more coherent and meaningful topics compared to traditional methods.

Quick Start & Requirements

Highlighted Details

  • Supports a wide range of topic modeling variations including supervised, semi-supervised, dynamic, hierarchical, multimodal, and zero-shot approaches.
  • Offers extensive visualization tools to explore topics, documents, and their relationships.
  • Allows fine-tuning topic representations using models like KeyBERT or even LLMs like GPT.
  • Modular design enables swapping or removing components like embedding models, dimensionality reduction, and clustering algorithms.

Maintenance & Community

  • Developed by Maarten Grootendorst.
  • Active development with frequent updates and new features.
  • Community support via GitHub issues.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatible with commercial use and closed-source applications.

Limitations & Caveats

The library's flexibility can lead to a steep learning curve. Performance is heavily dependent on the choice of embedding model and the quality of the underlying clustering. Some advanced features, like LLM integration for topic representation, may require additional API keys and setup.

Health Check
Last commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
5
Issues (30d)
8
Star History
235 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.