Tool for visualizing language differences across document types
Top 20.2% on sourcepulse
Scattertext is a Python library for visualizing how language differs across document categories. It enables users to identify distinguishing terms and phrases, presenting them in interactive HTML scatter plots with intelligent label placement to avoid overlap. The library is suitable for researchers, data scientists, and anyone needing to explore linguistic differences in text corpora.
How It Works
Scattertext builds interactive visualizations by mapping terms or phrases to a 2D scatter plot. The position of each point is determined by its association scores with different categories, often calculated using metrics like PMI, Scaled F-Score, or Cohen's d. The library leverages spaCy for robust text processing and offers extensive customization for plot appearance, term scoring, and data integration.
Quick Start & Requirements
pip install scattertext
spacy
, pandas
, numpy
, matplotlib
, sklearn
, gensim
, umap-learn
, pytextrank
, empath
, jieba
.Highlighted Details
Maintenance & Community
The project is actively maintained by Jason Kessler and has seen contributions from various community members. Links to community resources like Discord or Slack are not explicitly mentioned in the README.
Licensing & Compatibility
The library is released under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
While powerful, Scattertext can be resource-intensive for very large corpora, potentially leading to slow loading times in browsers. Some advanced features may require specific spaCy models or other external libraries. The documentation is extensive but noted as a work in progress.
3 months ago
1 day