awesome-document-similarity  by malteos

Curated list of resources on document similarity measures

created 5 years ago
251 stars

Top 99.8% on sourcepulse

GitHubView on GitHub
Project Summary

This repository is a curated list of resources on document similarity measures, targeting students and researchers in Natural Language Processing (NLP) and Information Retrieval (IR). It provides a comprehensive overview of papers, tutorials, and code for applications like recommender systems, clustering, and plagiarism detection, with a focus on long-form and rich content documents.

How It Works

The repository categorizes document similarity into lexical, structural, and semantic dimensions. It details various document representation techniques, from traditional Bag-of-Words and TF-IDF to modern dense embeddings like Word2Vec, GloVe, BERT, and Sentence Transformers. Similarity is typically computed using vector representations and distance metrics such as cosine similarity, Euclidean distance, or edit distance.

Quick Start & Requirements

This is a curated list, not a runnable application. No installation or execution commands are provided.

Highlighted Details

  • Covers a wide range of similarity dimensions: lexical, structural, and semantic.
  • Details numerous document representation methods, including traditional and modern embedding techniques (e.g., TF-IDF, Word2Vec, BERT, SPECTER).
  • Lists various similarity and distance measures (e.g., Cosine Similarity, Jaccard, Levenshtein, Word Mover's Distance).
  • Includes resources on Siamese Networks for learning sentence similarity.

Maintenance & Community

The repository is community-driven, encouraging contributions via pull requests. Links to related "Awesome" lists and specific GitHub repositories for implementations are provided.

Licensing & Compatibility

The repository itself is a list of links and does not have a specific license. The linked resources may have various licenses.

Limitations & Caveats

As a curated list, it does not offer direct functionality or code to run. The effectiveness of specific methods depends on the underlying research papers and implementations linked.

Health Check
Last commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.