ditto  by megagonlabs

Entity matching solution using pre-trained language models

created 5 years ago
284 stars

Top 93.1% on sourcepulse

GitHubView on GitHub
Project Summary

Ditto is an open-source library for deep entity matching (EM) that leverages pre-trained language models (LMs) like BERT. It addresses the challenge of identifying duplicate records across datasets by framing EM as a sequence-pair classification task, offering fine-tuning capabilities for enhanced accuracy. The library is primarily aimed at researchers and practitioners working with structured, dirty, or textual data requiring robust entity resolution.

How It Works

Ditto serializes data entries into text sequences, incorporating special tokens (COL, VAL) to delineate attribute names and values. This structured text is then fed into a fine-tuned pre-trained language model. The approach is advantageous due to the powerful contextual understanding provided by LMs, enabling nuanced comparisons between data entries. Novel optimizations like data augmentation (MixDA), domain-specific knowledge injection, and TF-IDF based summarization further boost performance and adaptability.

Quick Start & Requirements

  • Install:
    conda install -c conda-forge nvidia-apex
    pip install -r requirements.txt
    python -m spacy download en_core_web_lg
    
  • Prerequisites: Python 3.7.7, PyTorch 1.9, HuggingFace Transformers 4.9.2, Spacy (en_core_web_lg), NVIDIA Apex (for fp16 training).
  • Resources: Requires GPU for training and inference.
  • Demo: A Colab notebook is available for training and prediction.

Highlighted Details

  • Supports multiple pre-trained LMs: BERT, DistilBERT, and ALBERT.
  • Offers three key optimizations: Data Augmentation (MixDA), Domain Knowledge Injection, and Summarization.
  • Evaluated on ER_Magellan and WDC product matching benchmarks, demonstrating performance across various data characteristics.
  • Includes utilities for serializing data entries and managing configurations.

Maintenance & Community

The project is associated with megagonlabs. Further community or maintenance details are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state the license. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The project requires specific older versions of Python (3.7.7) and PyTorch (1.9), which may pose compatibility challenges with newer environments. NVIDIA Apex is a required dependency for fp16 training, adding a hardware-specific requirement.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.