C library for parsing/normalizing street addresses using statistical NLP
Top 11.1% on sourcepulse
libpostal is a C library designed for parsing and normalizing international street addresses. It leverages statistical Natural Language Processing (NLP) and open geospatial data to handle address variations across languages and conventions, making them suitable for machine processing, indexing, and comparison. This library is beneficial for applications dealing with location data, such as mapping services, delivery platforms, and geocoding systems, by simplifying and improving the consistency of address handling.
How It Works
The library employs Conditional Random Fields (CRFs) for address parsing, trained on over a billion addresses sourced from OpenStreetMap and OpenAddresses. It uses address-formatting templates to generate tagged training examples for every country, incorporating techniques like abbreviation expansion, numeric expression parsing, and transliteration to robustly handle messy real-world input. A language classifier, trained on OSM data, identifies address languages to apply appropriate normalization rules.
Quick Start & Requirements
./bootstrap.sh
, ./configure
, make
, and sudo make install
.--disable-sse2
for compilation.Highlighted Details
Maintenance & Community
The project is sponsored by various organizations and individuals. Contributions are welcomed via GitHub issues and pull requests.
Licensing & Compatibility
Limitations & Caveats
libpostal does not verify address validity or perform geocoding to latitude/longitude coordinates. It is focused solely on parsing and normalization. The Senzing model, while offering improved accuracy for specific regions, is larger than the default model.
2 weeks ago
1 day