text2vec  by shibing624

Text embeddings tool for vectorizing text

created 5 years ago
4,812 stars

Top 10.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a toolkit for converting text into vector representations, targeting developers and researchers working with natural language processing tasks like semantic similarity and text matching. It offers implementations of various text embedding models, including Word2Vec, Sentence-BERT, and CoSENT, enabling users to efficiently represent and compare textual data.

How It Works

The library implements several text embedding strategies: Word2Vec for word-level embeddings (averaged for sentences), Sentence-BERT (SBERT) for sentence embeddings using supervised training, and CoSENT, which improves upon SBERT with a ranking-based loss function for faster convergence and better performance. It also supports BGE (BAAI General Embedding) models, pre-trained and fine-tuned using contrastive learning.

Quick Start & Requirements

Highlighted Details

  • Supports multiple embedding models: Word2Vec, SBERT, CoSENT, and BGE.
  • Offers pre-trained models for Chinese and multilingual text, with benchmark results provided.
  • Includes training scripts for fine-tuning models on custom datasets and supports multi-GPU training/inference.
  • Provides a CLI tool for batch text vectorization and deployment options via Jina (gRPC) or FastAPI (HTTP).

Maintenance & Community

  • Recent updates include multi-GPU inference, CLI tools, and new Chinese/multilingual matching models.
  • Contact: Email xuming624@qq.com or WeChat (xuming624).

Licensing & Compatibility

  • Licensed under The Apache License 2.0, permitting commercial use with attribution.

Limitations & Caveats

  • The project is described as "rough" in terms of code quality, with a call for contributions and unit tests.
  • While it supports multi-GPU, specific configurations and performance may vary.
Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
114 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Didier Lopes Didier Lopes(Founder of OpenBB), and
11 more.

sentence-transformers by UKPLab

0.2%
17k
Framework for text embeddings, retrieval, and reranking
created 6 years ago
updated 3 days ago
Feedback? Help us improve.