text2vec by shibing624

Text embeddings tool for vectorizing text

Created 6 years ago

4,929 stars

Top 10.0% on SourcePulse

Project Summary

This repository provides a toolkit for converting text into vector representations, targeting developers and researchers working with natural language processing tasks like semantic similarity and text matching. It offers implementations of various text embedding models, including Word2Vec, Sentence-BERT, and CoSENT, enabling users to efficiently represent and compare textual data.

How It Works

The library implements several text embedding strategies: Word2Vec for word-level embeddings (averaged for sentences), Sentence-BERT (SBERT) for sentence embeddings using supervised training, and CoSENT, which improves upon SBERT with a ranking-based loss function for faster convergence and better performance. It also supports BGE (BAAI General Embedding) models, pre-trained and fine-tuned using contrastive learning.

Quick Start & Requirements

Install: pip install -U text2vec or pip install torch followed by pip install -r requirements.txt and pip install --no-deps . after cloning the repository.
Prerequisites: PyTorch. GPU with CUDA is recommended for training and faster inference.
Demo: Official Demo: https://www.mulanai.com/product/short_text_sim/, HuggingFace Demo: https://huggingface.co/spaces/shibing624/text2vec
Docs: https://github.com/shibing624/text2vec#readme

Highlighted Details

Supports multiple embedding models: Word2Vec, SBERT, CoSENT, and BGE.
Offers pre-trained models for Chinese and multilingual text, with benchmark results provided.
Includes training scripts for fine-tuning models on custom datasets and supports multi-GPU training/inference.
Provides a CLI tool for batch text vectorization and deployment options via Jina (gRPC) or FastAPI (HTTP).

Maintenance & Community

Recent updates include multi-GPU inference, CLI tools, and new Chinese/multilingual matching models.
Contact: Email xuming624@qq.com or WeChat (xuming624).

Licensing & Compatibility

Licensed under The Apache License 2.0, permitting commercial use with attribution.

Limitations & Caveats

The project is described as "rough" in terms of code quality, with a call for contributions and unit tests.
While it supports multi-GPU, specific configurations and performance may vary.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

6

Star History

23 stars in the last 30 days

Explore Similar Projects

Luotuo-Text-Embedding by LC1332

Text embedding model distilled from OpenAI API

Created 2 years ago

Updated 2 years ago

Starred by

Andrew Kane

Andrew Kane(Author of pgvector).

text2text by artitw

Text2Text toolkit for language modeling tasks

Created 5 years ago

Updated 1 year ago

fancy-nlp by boat-group

NLP toolkit for rapid prototyping and deployment

Created 6 years ago

Updated 3 years ago

NLPGNN by kyzhouhzau

NLP/GNN toolbox for TensorFlow 2.0 implementing various models

Created 5 years ago

Updated 1 year ago

nlp-tutorial by shibing624

NLP tutorial with examples for various tasks, good for learning NLP and PyTorch

Created 4 years ago

Updated 3 years ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

uniem by wangyuxinwhy

Unified embedding model for Chinese text

Created 2 years ago

Updated 2 years ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

pandallm by dandelionsllm

Open-source LLM project for Chinese language exploration

Created 2 years ago

Updated 2 years ago

nlp_notes by YangBin1729

NLP notes for ML/DL principles, examples, and model deployment

Created 6 years ago

Updated 5 years ago

text_similarity by adsieg

Resources for text similarity methods

Created 6 years ago

Updated 5 years ago

lightNLP by smilelight

NLP deep learning framework using PyTorch and Torchtext

Created 7 years ago

Updated 5 years ago

Chinese-XLNet by ymcui

Chinese XLNet pre-trained models for NLP tasks

Created 6 years ago

Updated 6 months ago

zero_nlp by yuanzhoulvpi2017

NLP solution for Chinese language models, data, training, and inference

Created 2 years ago

Updated 5 months ago

Feedback? Help us improve.