fast-langdetect  by LlmKira

Fast language detection powered by FastText

Created 2 years ago
293 stars

Top 90.4% on SourcePulse

GitHubView on GitHub
Project Summary

fast-langdetect offers an ultra-fast, highly accurate language detection library based on Facebook's FastText. It targets developers needing efficient language identification, providing significant speedups and offline capabilities suitable for high-throughput applications and resource-constrained environments.

How It Works

Leveraging pre-trained FastText models, the library achieves up to 95% accuracy. It provides a memory-friendly 'lite' model for offline use (~45-60 MB RSS) and a more accurate 'full' model (~170-210 MB RSS). An 'auto' mode intelligently falls back to 'lite' upon MemoryError during full model loading.

Quick Start & Requirements

  • Installation: pip install fast-langdetect
  • Python Support: 3.9 to 3.13.
  • Dependencies: No NumPy required.
  • Resource Footprint: Lite model: ~45-60 MB RSS; Full model: ~170-210 MB RSS. Models download to system temp by default, configurable via FTLANG_CACHE or LangDetectConfig(cache_dir=...).

Highlighted Details

  • Up to 80x faster than conventional methods.
  • Up to 95% accuracy.
  • Offline detection via memory-friendly 'lite' model.
  • Utilities for mapping BCP-47 codes to display names (using langcodes/pycountry).
  • Supports loading custom FastText language identification models.

Maintenance & Community

Builds upon zafercavdar/fasttext-langdetect with packaging enhancements. Mentions contributions from @dalf and github@JackyHe398. No specific community channels or active maintenance signals are detailed.

Licensing & Compatibility

  • Code License: MIT License.
  • Model License: CC BY-SA 3.0. Redistribution/modification of models requires CC BY-SA 3.0 compliance. Inference usage is unaffected.
  • Compatibility: MIT license is permissive for commercial use and closed-source linking.

Limitations & Caveats

Accuracy may decrease for very short or excessively long inputs (default max_input_length is 80 chars, truncation logs a warning). 'Auto' mode fallback is solely triggered by MemoryError; other errors propagate. User-provided cache directories must exist beforehand.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

pyctcdecode by kensho-technologies

0%
467
CTC beam search decoder for speech recognition
Created 4 years ago
Updated 2 years ago
Feedback? Help us improve.