text2vec  by shibing624

Text embeddings tool for vectorizing text

Created 5 years ago
4,850 stars

Top 10.3% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a toolkit for converting text into vector representations, targeting developers and researchers working with natural language processing tasks like semantic similarity and text matching. It offers implementations of various text embedding models, including Word2Vec, Sentence-BERT, and CoSENT, enabling users to efficiently represent and compare textual data.

How It Works

The library implements several text embedding strategies: Word2Vec for word-level embeddings (averaged for sentences), Sentence-BERT (SBERT) for sentence embeddings using supervised training, and CoSENT, which improves upon SBERT with a ranking-based loss function for faster convergence and better performance. It also supports BGE (BAAI General Embedding) models, pre-trained and fine-tuned using contrastive learning.

Quick Start & Requirements

Highlighted Details

  • Supports multiple embedding models: Word2Vec, SBERT, CoSENT, and BGE.
  • Offers pre-trained models for Chinese and multilingual text, with benchmark results provided.
  • Includes training scripts for fine-tuning models on custom datasets and supports multi-GPU training/inference.
  • Provides a CLI tool for batch text vectorization and deployment options via Jina (gRPC) or FastAPI (HTTP).

Maintenance & Community

  • Recent updates include multi-GPU inference, CLI tools, and new Chinese/multilingual matching models.
  • Contact: Email xuming624@qq.com or WeChat (xuming624).

Licensing & Compatibility

  • Licensed under The Apache License 2.0, permitting commercial use with attribution.

Limitations & Caveats

  • The project is described as "rough" in terms of code quality, with a call for contributions and unit tests.
  • While it supports multi-GPU, specific configurations and performance may vary.
Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
19 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.