awesome-document-similarity by malteos

Curated list of resources on document similarity measures

Created 6 years ago

256 stars

Top 98.5% on SourcePulse

Project Summary

This repository is a curated list of resources on document similarity measures, targeting students and researchers in Natural Language Processing (NLP) and Information Retrieval (IR). It provides a comprehensive overview of papers, tutorials, and code for applications like recommender systems, clustering, and plagiarism detection, with a focus on long-form and rich content documents.

How It Works

The repository categorizes document similarity into lexical, structural, and semantic dimensions. It details various document representation techniques, from traditional Bag-of-Words and TF-IDF to modern dense embeddings like Word2Vec, GloVe, BERT, and Sentence Transformers. Similarity is typically computed using vector representations and distance metrics such as cosine similarity, Euclidean distance, or edit distance.

Quick Start & Requirements

This is a curated list, not a runnable application. No installation or execution commands are provided.

Highlighted Details

Covers a wide range of similarity dimensions: lexical, structural, and semantic.
Details numerous document representation methods, including traditional and modern embedding techniques (e.g., TF-IDF, Word2Vec, BERT, SPECTER).
Lists various similarity and distance measures (e.g., Cosine Similarity, Jaccard, Levenshtein, Word Mover's Distance).
Includes resources on Siamese Networks for learning sentence similarity.

Maintenance & Community

The repository is community-driven, encouraging contributions via pull requests. Links to related "Awesome" lists and specific GitHub repositories for implementations are provided.

Licensing & Compatibility

The repository itself is a list of links and does not have a specific license. The linked resources may have various licenses.

Limitations & Caveats

As a curated list, it does not offer direct functionality or code to run. The effectiveness of specific methods depends on the underlying research papers and implementations linked.

awesome-document-similarity by malteos

Explore Similar Projects

Luotuo-Text-Embedding by LC1332

awesome-metric-learning by qdrant

Semantic-Retrieval-Models by caiyinqiong

awesome-semantic-search by Agrover112

SearchPaperByEmbedding by gyj155

similarity-search-kit by ZachNagengast

similarities by shibing624

text_similarity by adsieg

Chinese-LangChain by yanqiangmiffy

typesense by typesense

PageIndex by VectifyAI

sentence-transformers by huggingface