libpostal  by openvenues

C library for parsing/normalizing street addresses using statistical NLP

created 10 years ago
4,500 stars

Top 11.1% on sourcepulse

GitHubView on GitHub
Project Summary

libpostal is a C library designed for parsing and normalizing international street addresses. It leverages statistical Natural Language Processing (NLP) and open geospatial data to handle address variations across languages and conventions, making them suitable for machine processing, indexing, and comparison. This library is beneficial for applications dealing with location data, such as mapping services, delivery platforms, and geocoding systems, by simplifying and improving the consistency of address handling.

How It Works

The library employs Conditional Random Fields (CRFs) for address parsing, trained on over a billion addresses sourced from OpenStreetMap and OpenAddresses. It uses address-formatting templates to generate tagged training examples for every country, incorporating techniques like abbreviation expansion, numeric expression parsing, and transliteration to robustly handle messy real-world input. A language classifier, trained on OSM data, identifies address languages to apply appropriate normalization rules.

Quick Start & Requirements

  • Installation: Requires build tools (gcc, autoconf, etc.), curl, and pkg-config. Installation involves cloning the repository, running ./bootstrap.sh, ./configure, make, and sudo make install.
  • Data: Requires downloading several gigabytes of data files.
  • Platform: Primarily targets Linux and macOS. Windows support requires MSys2/MinGW. M1 Macs may need --disable-sse2 for compilation.
  • Resources: Building and data download can take significant time and disk space (several GB).
  • Documentation: Official blog posts and GitHub repository provide detailed information.

Highlighted Details

  • Achieves 99.45% full parse accuracy on held-out test data.
  • Supports over 60 languages for normalization and over 30 for numeric expression parsing.
  • Offers an alternative Senzing model for improved parsing of US, UK, and Singapore addresses.
  • Provides official language bindings for Python, Ruby, Go, Java, PHP, and NodeJS.

Maintenance & Community

The project is sponsored by various organizations and individuals. Contributions are welcomed via GitHub issues and pull requests.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive license allows for commercial use and integration into closed-source applications.

Limitations & Caveats

libpostal does not verify address validity or perform geocoding to latitude/longitude coordinates. It is focused solely on parsing and normalization. The Senzing model, while offering improved accuracy for specific regions, is larger than the default model.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
1
Star History
274 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.