rakutenma  by rakuten-nlp

JS library for Chinese/Japanese morphological analysis (segmentation + PoS tagging)

created 11 years ago
472 stars

Top 65.5% on sourcepulse

GitHubView on GitHub
Project Summary

Rakuten MA is a pure JavaScript library for morphological analysis (word segmentation and Part-of-Speech tagging) of Chinese and Japanese text. It is designed for both browser and Node.js environments, offering online learning capabilities for model updates and various optimizations for compact model representation, making it suitable for web-based NLP applications.

How It Works

Rakuten MA implements a language-independent character tagging model using the Soft Confidence Weighted (SCW) learning algorithm. It supports customizable feature sets, including character unigrams, bigrams, and character type features, with optional feature hashing and quantization for model size reduction. The library allows for incremental training, enabling users to adapt pre-trained models or build new ones from scratch.

Quick Start & Requirements

  • Install: npm install rakutenma
  • Usage: Can be used directly via require('rakutenma') in Node.js or by including rakutenma.js in HTML for browser use.
  • Prerequisites: Node.js for server-side usage. Browser compatibility confirmed for IE 8+, Chrome 35+, Firefox 16+, Safari 6.1+.
  • Resources: Bundled models are available for Chinese and Japanese.
  • Docs: API Documentation

Highlighted Details

  • Pure JavaScript implementation for cross-platform compatibility.
  • Supports online learning (SCW) for incremental model updates.
  • Features customizable feature sets, feature hashing, and quantization for model optimization.
  • Includes pre-trained models for Chinese and Japanese, trained on general and e-commerce corpora.
  • Offers a demo page for interactive testing.

Maintenance & Community

  • Developed by the Rakuten NLP Project, sponsored by Rakuten, Inc. and Rakuten Institute of Technology.
  • Acknowledgements list several key contributors.
  • Bug reports and pull requests are managed via GitHub issues.
  • Contact email: prj-rakutenma [at] mail.rakuten.com.

Licensing & Compatibility

  • Licensed under the Apache License version 2.0.
  • Commercial use is permitted under the terms of the Apache License.

Limitations & Caveats

  • Only supports simplified Chinese.
  • Model files for browser use require conversion from JSON to JS format.
  • Using bundled models requires specific feature hashing (15-bit) and feature sets to avoid analysis errors.
Health Check
Last commit

6 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.