Luotuo-Text-Embedding  by LC1332

Text embedding model distilled from OpenAI API

created 2 years ago
266 stars

Top 96.9% on sourcepulse

GitHubView on GitHub
Project Summary

Luotuo Embedding (骆驼嵌入) is a text embedding model developed by LC1332, aiming to provide a competitive alternative to OpenAI's embedding API. It is designed for researchers and developers working on downstream NLP tasks such as text visualization, retrieval, clustering, and content moderation, offering a distilled generative text embedding model.

How It Works

The model is distilled from OpenAI's API, utilizing a combination of loss functions including Mean Squared Error (MSE) against OpenAI's features, Cross-Entropy (CSE) on similarity matrices with a diagonal ground truth, and KL divergence between OpenAI's relevance matrix and the model's output. BERT and GLM architectures are adapted, with BERT models augmented by a fully connected layer to reach 1536 dimensions, and GLM models using final layer hidden vectors passed through a fully connected layer and then a BERT.

Quick Start & Requirements

  • Install/Run: Colab notebooks are provided for quick experimentation.
  • Prerequisites: Python, Hugging Face Transformers, PyTorch. Specific model versions may have varying dependencies.
  • Resources: Colab links are available for immediate testing. Model weights are hosted on Hugging Face.
  • Links: Colab, Details

Highlighted Details

  • Offers multiple model sizes: small (BERT 110M), medium (BERT 352M), and large (GLM-Encoder).
  • Demonstrates competitive performance against OpenAI's API, particularly in text relevance and fuzzy search tasks.
  • Includes visualization tools for text data, showcasing clustering and relevance patterns.
  • Training data includes CNewSum (234.5K news articles) processed with OpenAI's embedding API.

Maintenance & Community

The project is part of the larger Luotuo (骆驼) project. Key contributors are listed, with acknowledgments for donated A100 compute from Dongwu Securities. Community engagement is encouraged via starring the main repo.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. This requires further investigation for commercial use or closed-source linking.

Limitations & Caveats

The model is primarily trained on news data, which may limit performance on significantly different domains without further fine-tuning. The README indicates that some features and models (large model, certain training code) are still under development or planned for future release.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Didier Lopes Didier Lopes(Founder of OpenBB), and
11 more.

sentence-transformers by UKPLab

0.2%
17k
Framework for text embeddings, retrieval, and reranking
created 6 years ago
updated 3 days ago
Feedback? Help us improve.