scattertext  by JasonKessler

Tool for visualizing language differences across document types

created 9 years ago
2,307 stars

Top 20.2% on sourcepulse

GitHubView on GitHub
Project Summary

Scattertext is a Python library for visualizing how language differs across document categories. It enables users to identify distinguishing terms and phrases, presenting them in interactive HTML scatter plots with intelligent label placement to avoid overlap. The library is suitable for researchers, data scientists, and anyone needing to explore linguistic differences in text corpora.

How It Works

Scattertext builds interactive visualizations by mapping terms or phrases to a 2D scatter plot. The position of each point is determined by its association scores with different categories, often calculated using metrics like PMI, Scaled F-Score, or Cohen's d. The library leverages spaCy for robust text processing and offers extensive customization for plot appearance, term scoring, and data integration.

Quick Start & Requirements

  • Install via pip: pip install scattertext
  • Recommended dependencies: spacy, pandas, numpy, matplotlib, sklearn, gensim, umap-learn, pytextrank, empath, jieba.
  • Python 3.7+ is recommended.
  • HTML outputs are best viewed in Chrome and Safari.
  • Official documentation and tutorials are available in the README.

Highlighted Details

  • Supports visualization of unigrams, bigrams, noun chunks, and custom phrases via libraries like PyTextRank.
  • Offers a wide array of term scoring methods, including Scaled F-Score, Cohen's d, Cliff's Delta, BNS, and custom correlations.
  • Includes features for visualizing topic models, word embeddings (via Gensim, UMAP, SVD), and emoji usage.
  • Provides a command-line interface for basic analysis without Python scripting.
  • Allows for custom term positioning and axis scaling.

Maintenance & Community

The project is actively maintained by Jason Kessler and has seen contributions from various community members. Links to community resources like Discord or Slack are not explicitly mentioned in the README.

Licensing & Compatibility

The library is released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

While powerful, Scattertext can be resource-intensive for very large corpora, potentially leading to slow loading times in browsers. Some advanced features may require specific spaCy models or other external libraries. The documentation is extensive but noted as a work in progress.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
5 more.

BERTopic by MaartenGr

0.2%
7k
Topic modeling with transformers and c-TF-IDF
created 4 years ago
updated 3 weeks ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
5 more.

pattern by clips

0.0%
9k
Python web mining module
created 14 years ago
updated 1 year ago
Feedback? Help us improve.