topicGPT  by chtmp223

Prompt-based framework for topic modeling research

created 1 year ago
330 stars

Top 84.2% on sourcepulse

GitHubView on GitHub
Project Summary

TopicGPT provides a prompt-based framework for topic modeling, enabling users to generate hierarchical topics, refine them, and assign them to documents using large language models. It is designed for researchers and practitioners in natural language processing and data analysis who need a flexible and powerful approach to uncovering thematic structures in text data.

How It Works

TopicGPT leverages large language models (LLMs) through a series of prompting strategies to perform topic modeling. It first generates high-level topics, then drills down into more specific, low-level topics within each high-level category. The framework includes functions to refine topics by merging similar ones and removing irrelevant ones, and to assign topics to documents with supporting quotes. This approach allows for dynamic topic generation and adaptation without requiring pre-defined vocabularies or extensive parameter tuning, offering a more intuitive and potentially more nuanced understanding of text content.

Quick Start & Requirements

  • Install via pip: pip install topicgpt_python
  • Requires Python 3.9+.
  • Supports OpenAI API, VertexAI, Azure API, Gemini API, and vLLM.
  • API keys and project/location details must be set as environment variables.
  • Data should be in .jsonl format with "text" field.
  • See: demo.ipynb

Highlighted Details

  • Offers five core functions: generate_topic_lvl1, generate_topic_lvl2, refine_topics, assign_topics, and correct_topics.
  • Includes metric calculation functions for evaluating topic alignment with ground-truth labels (ARI, Purity, NMI).
  • Supports multiple LLM providers and local inference via vLLM.
  • Paper accepted at NAACL'24.

Maintenance & Community

  • The primary author is Chau Minh Pham.
  • The project is associated with the NAACL'24 conference.

Licensing & Compatibility

  • The repository does not explicitly state a license. The code is provided for research purposes related to the NAACL'24 paper.

Limitations & Caveats

  • The framework's performance is dependent on the underlying LLM's capabilities and the quality of prompts.
  • API usage incurs costs, and the README advises testing with cheaper models first.
  • No explicit license is mentioned, which may impact commercial use or integration into closed-source projects.
Health Check
Last commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
3
Star History
42 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
5 more.

BERTopic by MaartenGr

0.1%
7k
Topic modeling with transformers and c-TF-IDF
created 4 years ago
updated 3 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Teknium Teknium(Cofounder of Nous Research), and
3 more.

storm by stanford-oval

0.4%
27k
LLM system for automated knowledge curation and article generation
created 1 year ago
updated 1 month ago
Feedback? Help us improve.