Unified embedding model for Chinese text
Top 41.4% on SourcePulse
This project provides tools and models for creating unified Chinese text embeddings, targeting researchers and developers working with Chinese NLP tasks. It offers training, fine-tuning, and evaluation code, with models and datasets available on HuggingFace, aiming to deliver the best general-purpose Chinese text embedding models.
How It Works
The project leverages the sentence-transformers
library for model compatibility and ease of use. It introduces the M3E (Moka Massive Mixed Embedding) model series, which are trained and fine-tuned for Chinese text. The uniem
library includes a FineTuner
class that simplifies adapting pre-trained models to custom datasets using various fine-tuning techniques like Prefix Tuning.
Quick Start & Requirements
pip install sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("moka-ai/m3e-base")
embeddings = model.encode(['Hello World!', '你好,世界!'])
conda create -n uniem python=3.10
and pip install uniem
.Highlighted Details
text-embedding-ada-002
on Chinese text classification and retrieval tasks.sentence_transformers
, and text2vec
, and can train GPT models using SGPT or Prefix Tuning.Maintenance & Community
The project is actively developed, with recent updates in July 2023 introducing enhanced fine-tuning capabilities and the MTEB-zh benchmark. Contributions for adding datasets or models to MTEB-zh are welcomed via issues or pull requests.
Licensing & Compatibility
Licensed under Apache-2.0, allowing for commercial use and integration into closed-source projects.
Limitations & Caveats
The README mentions that for retrieval ranking benchmarks, only the top 10,000 articles from the T2Ranking dataset were used due to cost and time constraints with OpenAI's API. The FineTuner
API has minor breaking changes from version 0.2.0.
1 year ago
Inactive