fast-langdetect  by LlmKira

Fast language detection powered by FastText

Created 1 year ago
254 stars

Top 99.1% on SourcePulse

GitHubView on GitHub
Project Summary

fast-langdetect offers an ultra-fast, highly accurate language detection library based on Facebook's FastText. It targets developers needing efficient language identification, providing significant speedups and offline capabilities suitable for high-throughput applications and resource-constrained environments.

How It Works

Leveraging pre-trained FastText models, the library achieves up to 95% accuracy. It provides a memory-friendly 'lite' model for offline use (~45-60 MB RSS) and a more accurate 'full' model (~170-210 MB RSS). An 'auto' mode intelligently falls back to 'lite' upon MemoryError during full model loading.

Quick Start & Requirements

  • Installation: pip install fast-langdetect
  • Python Support: 3.9 to 3.13.
  • Dependencies: No NumPy required.
  • Resource Footprint: Lite model: ~45-60 MB RSS; Full model: ~170-210 MB RSS. Models download to system temp by default, configurable via FTLANG_CACHE or LangDetectConfig(cache_dir=...).

Highlighted Details

  • Up to 80x faster than conventional methods.
  • Up to 95% accuracy.
  • Offline detection via memory-friendly 'lite' model.
  • Utilities for mapping BCP-47 codes to display names (using langcodes/pycountry).
  • Supports loading custom FastText language identification models.

Maintenance & Community

Builds upon zafercavdar/fasttext-langdetect with packaging enhancements. Mentions contributions from @dalf and github@JackyHe398. No specific community channels or active maintenance signals are detailed.

Licensing & Compatibility

  • Code License: MIT License.
  • Model License: CC BY-SA 3.0. Redistribution/modification of models requires CC BY-SA 3.0 compliance. Inference usage is unaffected.
  • Compatibility: MIT license is permissive for commercial use and closed-source linking.

Limitations & Caveats

Accuracy may decrease for very short or excessively long inputs (default max_input_length is 80 chars, truncation logs a warning). 'Auto' mode fallback is solely triggered by MemoryError; other errors propagate. User-provided cache directories must exist beforehand.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

pyctcdecode by kensho-technologies

0%
462
CTC beam search decoder for speech recognition
Created 4 years ago
Updated 2 years ago
Feedback? Help us improve.