Text embedding model distilled from OpenAI API
Top 96.9% on sourcepulse
Luotuo Embedding (骆驼嵌入) is a text embedding model developed by LC1332, aiming to provide a competitive alternative to OpenAI's embedding API. It is designed for researchers and developers working on downstream NLP tasks such as text visualization, retrieval, clustering, and content moderation, offering a distilled generative text embedding model.
How It Works
The model is distilled from OpenAI's API, utilizing a combination of loss functions including Mean Squared Error (MSE) against OpenAI's features, Cross-Entropy (CSE) on similarity matrices with a diagonal ground truth, and KL divergence between OpenAI's relevance matrix and the model's output. BERT and GLM architectures are adapted, with BERT models augmented by a fully connected layer to reach 1536 dimensions, and GLM models using final layer hidden vectors passed through a fully connected layer and then a BERT.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
The project is part of the larger Luotuo (骆驼) project. Key contributors are listed, with acknowledgments for donated A100 compute from Dongwu Securities. Community engagement is encouraged via starring the main repo.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. This requires further investigation for commercial use or closed-source linking.
Limitations & Caveats
The model is primarily trained on news data, which may limit performance on significantly different domains without further fine-tuning. The README indicates that some features and models (large model, certain training code) are still under development or planned for future release.
1 year ago
1 week