chardet  by chardet

Python character encoding and language detection

Created 13 years ago
2,590 stars

Top 17.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

chardet 7 is a Python character encoding detector offering high accuracy and speed. It targets developers and users needing to process unknown text data, providing a drop-in replacement for older versions with significant performance gains. Its 0BSD license ensures broad applicability.

How It Works

The library utilizes a 13-stage detection pipeline, incorporating Byte Order Mark (BOM) detection, magic number identification, structural probing, byte validity filtering, and bigram statistical models. Optional mypyc compilation further accelerates processing. This comprehensive approach yields superior accuracy and speed compared to predecessors and competitors.

Quick Start & Requirements

  • Primary install: pip install chardet
  • Prerequisites: Python 3.10+; zero runtime dependencies.
  • Documentation: chardet.readthedocs.io

Highlighted Details

  • Accuracy: 99.3% on 2,517 files, a +11.1 percentage point improvement over chardet 6.0.0.
  • Speed: 47x faster than chardet 6.0.0, 1.5x faster than charset-normalizer 3.4.6.
  • Language Detection: 95.7% accuracy across 49 languages.
  • MIME Type Detection: Identifies 40+ binary formats and common text types.
  • Streaming: Supports incremental processing for large data.
  • Filtering: Features encoding era and specific encoding include/exclude filters.

Maintenance & Community

chardet 7.x is a 2026 ground-up rewrite by Dan Blanchard, distinct from earlier codebases. Historical commits from the original author are preserved in a separate branch. No specific community channels are listed.

Licensing & Compatibility

Licensed under the permissive 0BSD license, allowing unrestricted commercial and closed-source use.

Limitations & Caveats

chardet 7.x is a complete rewrite, not a derivative of pre-version 7 code. While API-compatible, this architectural divergence may be a factor for users requiring strict code lineage.

Health Check
Last Commit

15 hours ago

Responsiveness

Inactive

Pull Requests (30d)
15
Issues (30d)
8
Star History
127 stars in the last 30 days

Explore Similar Projects

Starred by Kaichao You Kaichao You(Core Maintainer of vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

lm-format-enforcer by noamgat

0.2%
2k
Format enforcer for language model outputs (JSON, regex, etc.)
Created 2 years ago
Updated 1 week ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

pyctcdecode by kensho-technologies

0%
469
CTC beam search decoder for speech recognition
Created 4 years ago
Updated 2 years ago
Feedback? Help us improve.