wtpsplit  by segment-any-text

Text segmentation toolkit for robust sentence splitting

Created 5 years ago
1,128 stars

Top 34.0% on SourcePulse

GitHubView on GitHub
Project Summary

This library provides robust, efficient, and adaptable text segmentation into sentences or other semantic units, targeting NLP researchers and developers. It offers state-of-the-art performance across 85 languages with its SaT models, significantly outperforming previous methods and existing tools.

How It Works

The core is the SaT (Segment Any Text) model, a universal approach to sentence segmentation. It leverages a transformer architecture, offering improved performance and reduced computational cost compared to the previous WtP (Where's the Point?) model. SaT models can be further adapted to specific domains or languages using LoRA, enabling highly customized segmentation.

Quick Start & Requirements

  • Install: pip install wtpsplit or pip install wtpsplit[onnx-gpu] / pip install wtpsplit[onnx-cpu] for ONNX support.
  • Requirements: Python, PyTorch. GPU with CUDA is recommended for performance.
  • Docs: https://github.com/segment-any-text/wtpsplit

Highlighted Details

  • State-of-the-art performance across 85 languages, outperforming SpaCy, PySBD, and others.
  • ONNX support for up to ~50% faster inference on GPUs.
  • Domain and style adaptation via LoRA for 81 languages, including legal documents, code-switching, and tweets.
  • Supports paragraph segmentation by leveraging newline prediction capabilities.

Maintenance & Community

Licensing & Compatibility

  • Apache 2.0 License.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The library includes previous WtP models for reproducibility, but SaT is the recommended and actively developed model. Some advanced features like LoRA export for ONNX are experimental.

Health Check
Last Commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
24 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), and
16 more.

pytext by facebookresearch

0%
6k
NLP framework (deprecated, migrate to torchtext)
Created 7 years ago
Updated 2 years ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Eugene Yan Eugene Yan(AI Scientist at AWS), and
13 more.

text by pytorch

0%
4k
PyTorch library for NLP tasks
Created 8 years ago
Updated 1 day ago
Feedback? Help us improve.