Entity matching solution using pre-trained language models
Top 93.1% on sourcepulse
Ditto is an open-source library for deep entity matching (EM) that leverages pre-trained language models (LMs) like BERT. It addresses the challenge of identifying duplicate records across datasets by framing EM as a sequence-pair classification task, offering fine-tuning capabilities for enhanced accuracy. The library is primarily aimed at researchers and practitioners working with structured, dirty, or textual data requiring robust entity resolution.
How It Works
Ditto serializes data entries into text sequences, incorporating special tokens (COL
, VAL
) to delineate attribute names and values. This structured text is then fed into a fine-tuned pre-trained language model. The approach is advantageous due to the powerful contextual understanding provided by LMs, enabling nuanced comparisons between data entries. Novel optimizations like data augmentation (MixDA), domain-specific knowledge injection, and TF-IDF based summarization further boost performance and adaptability.
Quick Start & Requirements
conda install -c conda-forge nvidia-apex
pip install -r requirements.txt
python -m spacy download en_core_web_lg
en_core_web_lg
), NVIDIA Apex (for fp16 training).Highlighted Details
Maintenance & Community
The project is associated with megagonlabs. Further community or maintenance details are not explicitly provided in the README.
Licensing & Compatibility
The README does not explicitly state the license. Users should verify licensing for commercial or closed-source use.
Limitations & Caveats
The project requires specific older versions of Python (3.7.7) and PyTorch (1.9), which may pose compatibility challenges with newer environments. NVIDIA Apex is a required dependency for fp16 training, adding a hardware-specific requirement.
1 year ago
Inactive