jieba-php by fukuball

PHP library for Chinese text segmentation

Created 10 years ago

1,370 stars

Top 29.3% on SourcePulse

Project Summary

This PHP library provides robust Chinese text segmentation (word breaking) capabilities, suitable for developers needing to process Chinese text for analysis, search, or other NLP tasks. It offers multiple segmentation modes, supports traditional Chinese, custom dictionaries, and integrates TF-IDF for keyword extraction and POS tagging.

How It Works

The library implements core NLP algorithms for word segmentation: Trie tree for efficient word graph scanning, dynamic programming for finding maximum probability paths based on word frequency, and a Hidden Markov Model (HMM) with the Viterbi algorithm for handling unknown words. This combination ensures accurate and efficient segmentation.

Quick Start & Requirements

Installation: composer require fukuball/jieba-php
Usage: Include the autoloader (require_once "/path/to/your/vendor/autoload.php";) and initialize classes like Jieba::init().
Dependencies: PHP.

Highlighted Details

Supports three segmentation modes: Accurate, Full, and Search Engine.
Offers Traditional Chinese segmentation and custom dictionary support.
Integrates TF-IDF for keyword extraction and Part-of-Speech (POS) tagging.
Handles multi-language CJK (Chinese, Japanese, Korean) text processing.
Includes memory management features for handling large datasets.

Maintenance & Community

The project is actively maintained by fukuball. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The software is released under the MIT License, allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

While the library notes that LLM-based segmentation may yield better results, it positions itself as a fast and cost-effective alternative. The README mentions that some words not in the dictionary might still be recognized by the Viterbi algorithm, implying potential edge cases.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days