scattertext by JasonKessler

Tool for visualizing language differences across document types

Created 9 years ago

2,328 stars

Top 19.4% on SourcePulse

View on GitHub

4 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Luis Capelo

Cofounder of Lightning AI

Project Summary

Scattertext is a Python library for visualizing how language differs across document categories. It enables users to identify distinguishing terms and phrases, presenting them in interactive HTML scatter plots with intelligent label placement to avoid overlap. The library is suitable for researchers, data scientists, and anyone needing to explore linguistic differences in text corpora.

How It Works

Scattertext builds interactive visualizations by mapping terms or phrases to a 2D scatter plot. The position of each point is determined by its association scores with different categories, often calculated using metrics like PMI, Scaled F-Score, or Cohen's d. The library leverages spaCy for robust text processing and offers extensive customization for plot appearance, term scoring, and data integration.

Quick Start & Requirements

Install via pip: pip install scattertext
Recommended dependencies: spacy, pandas, numpy, matplotlib, sklearn, gensim, umap-learn, pytextrank, empath, jieba.
Python 3.7+ is recommended.
HTML outputs are best viewed in Chrome and Safari.
Official documentation and tutorials are available in the README.

Highlighted Details

Supports visualization of unigrams, bigrams, noun chunks, and custom phrases via libraries like PyTextRank.
Offers a wide array of term scoring methods, including Scaled F-Score, Cohen's d, Cliff's Delta, BNS, and custom correlations.
Includes features for visualizing topic models, word embeddings (via Gensim, UMAP, SVD), and emoji usage.
Provides a command-line interface for basic analysis without Python scripting.
Allows for custom term positioning and axis scaling.

Maintenance & Community

The project is actively maintained by Jason Kessler and has seen contributions from various community members. Links to community resources like Discord or Slack are not explicitly mentioned in the README.

Licensing & Compatibility

The library is released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

While powerful, Scattertext can be resource-intensive for very large corpora, potentially leading to slow loading times in browsers. Some advanced features may require specific spaCy models or other external libraries. The documentation is extensive but noted as a work in progress.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days