ditto by megagonlabs

Entity matching solution using pre-trained language models

Created 5 years ago

303 stars

Top 88.4% on SourcePulse

Project Summary

Ditto is an open-source library for deep entity matching (EM) that leverages pre-trained language models (LMs) like BERT. It addresses the challenge of identifying duplicate records across datasets by framing EM as a sequence-pair classification task, offering fine-tuning capabilities for enhanced accuracy. The library is primarily aimed at researchers and practitioners working with structured, dirty, or textual data requiring robust entity resolution.

How It Works

Ditto serializes data entries into text sequences, incorporating special tokens (COL, VAL) to delineate attribute names and values. This structured text is then fed into a fine-tuned pre-trained language model. The approach is advantageous due to the powerful contextual understanding provided by LMs, enabling nuanced comparisons between data entries. Novel optimizations like data augmentation (MixDA), domain-specific knowledge injection, and TF-IDF based summarization further boost performance and adaptability.

Quick Start & Requirements

Install:

conda install -c conda-forge nvidia-apex
pip install -r requirements.txt
python -m spacy download en_core_web_lg

Prerequisites: Python 3.7.7, PyTorch 1.9, HuggingFace Transformers 4.9.2, Spacy (en_core_web_lg), NVIDIA Apex (for fp16 training).
Resources: Requires GPU for training and inference.
Demo: A Colab notebook is available for training and prediction.

Highlighted Details

Supports multiple pre-trained LMs: BERT, DistilBERT, and ALBERT.
Offers three key optimizations: Data Augmentation (MixDA), Domain Knowledge Injection, and Summarization.
Evaluated on ER_Magellan and WDC product matching benchmarks, demonstrating performance across various data characteristics.
Includes utilities for serializing data entries and managing configurations.

Maintenance & Community

The project is associated with megagonlabs. Further community or maintenance details are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state the license. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The project requires specific older versions of Python (3.7.7) and PyTorch (1.9), which may pose compatibility challenges with newer environments. NVIDIA Apex is a required dependency for fp16 training, adding a hardware-specific requirement.

ditto by megagonlabs

Explore Similar Projects

rho by microsoft

turbo-alignment by turbo-llm

fancy-nlp by boat-group

X-VLM by zengyan-97

finetune by IndicoDataSolutions

NER-BERT-pytorch by lemonhu

MatchZoo-py by NTMC-Community

training-fine-tuning-large-language-models-workshop-dhs2024 by dipanjanS

beir by beir-cellar

fairseq-lua by facebookresearch

text_classification by brightmart

fairseq by facebookresearch