sentence-transformers by huggingface

Framework for text embeddings, retrieval, and reranking

Created 6 years ago

18,089 stars

Top 2.5% on SourcePulse

View on GitHub

27 Experts Love This Project

Julien Chaumond

Cofounder of Hugging Face

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Luis Capelo

Cofounder of Lightning AI

Didier Lopes

Founder of OpenBB

and 23 more!

Project Summary

This framework provides an easy method to compute state-of-the-art text embeddings and perform semantic search, similarity, and reranking. It is designed for researchers and developers working with Natural Language Processing (NLP) tasks, offering access to over 10,000 pre-trained models and the ability to train custom models.

How It Works

The library leverages Siamese and Siamese-like network structures (Sentence-BERT) to generate fixed-size embeddings for sentences or paragraphs. This approach allows for efficient similarity calculations using cosine similarity on these embeddings. For improved relevance in search scenarios, it also incorporates Cross-Encoder models that directly compare pairs of texts, yielding higher accuracy but at a greater computational cost.

Quick Start & Requirements

Install with pip: pip install -U sentence-transformers or conda install -c conda-forge sentence-transformers.
Requires Python 3.9+, PyTorch 1.11.0+, and transformers v4.34.0+.
GPU/CUDA support requires PyTorch installation with matching CUDA version.
Official documentation: www.SBERT.net.

Highlighted Details

Access to over 10,000 pre-trained models from Hugging Face, including state-of-the-art models from the MTEB leaderboard.
Supports training and fine-tuning of custom embedding and reranker models.
Offers extensive training options: various transformer architectures (BERT, RoBERTa, etc.), multilingual and multi-task learning, and over 30 loss functions.
Enables diverse applications: semantic search, textual similarity, clustering, paraphrase mining, and more.

Maintenance & Community

Maintained by Tom Aarsen at UKPLab, TU Darmstadt.
Encourages opening issues for problems or questions.
Citable publications are provided for the Sentence-BERT and multilingual models.

Licensing & Compatibility

The repository is published for research purposes.
Licensing details are not explicitly stated in the README, but usage of underlying models may vary.

Limitations & Caveats

The README states the repository contains "experimental software" and is published for "additional background details on the respective publication," suggesting it may not be intended for production use without further evaluation.

Health Check

Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

118 stars in the last 30 days