Luotuo-Text-Embedding by LC1332

Text embedding model distilled from OpenAI API

Created 2 years ago

267 stars

Top 96.1% on SourcePulse

Project Summary

Luotuo Embedding (骆驼嵌入) is a text embedding model developed by LC1332, aiming to provide a competitive alternative to OpenAI's embedding API. It is designed for researchers and developers working on downstream NLP tasks such as text visualization, retrieval, clustering, and content moderation, offering a distilled generative text embedding model.

How It Works

The model is distilled from OpenAI's API, utilizing a combination of loss functions including Mean Squared Error (MSE) against OpenAI's features, Cross-Entropy (CSE) on similarity matrices with a diagonal ground truth, and KL divergence between OpenAI's relevance matrix and the model's output. BERT and GLM architectures are adapted, with BERT models augmented by a fully connected layer to reach 1536 dimensions, and GLM models using final layer hidden vectors passed through a fully connected layer and then a BERT.

Quick Start & Requirements

Install/Run: Colab notebooks are provided for quick experimentation.
Prerequisites: Python, Hugging Face Transformers, PyTorch. Specific model versions may have varying dependencies.
Resources: Colab links are available for immediate testing. Model weights are hosted on Hugging Face.
Links: Colab, Details

Highlighted Details

Offers multiple model sizes: small (BERT 110M), medium (BERT 352M), and large (GLM-Encoder).
Demonstrates competitive performance against OpenAI's API, particularly in text relevance and fuzzy search tasks.
Includes visualization tools for text data, showcasing clustering and relevance patterns.
Training data includes CNewSum (234.5K news articles) processed with OpenAI's embedding API.

Maintenance & Community

The project is part of the larger Luotuo (骆驼) project. Key contributors are listed, with acknowledgments for donated A100 compute from Dongwu Securities. Community engagement is encouraged via starring the main repo.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. This requires further investigation for commercial use or closed-source linking.

Limitations & Caveats

The model is primarily trained on news data, which may limit performance on significantly different domains without further fine-tuning. The README indicates that some features and models (large model, certain training code) are still under development or planned for future release.

Luotuo-Text-Embedding by LC1332

Explore Similar Projects

mauve by krishnap25

awesome-semantic-search by Agrover112

fancy-nlp by boat-group

ANCE by microsoft

WordLlama by dleemiller

finetune-embedding by run-llama

100-Days-of-NLP by graviraja

text_similarity by adsieg

awesome-sentence-embedding by Separius

bert-utils by terrifyzhao

text2vec by shibing624

sentence-transformers by huggingface