uniem by wangyuxinwhy

Unified embedding model for Chinese text

Created 2 years ago

873 stars

Top 41.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

This project provides tools and models for creating unified Chinese text embeddings, targeting researchers and developers working with Chinese NLP tasks. It offers training, fine-tuning, and evaluation code, with models and datasets available on HuggingFace, aiming to deliver the best general-purpose Chinese text embedding models.

How It Works

The project leverages the sentence-transformers library for model compatibility and ease of use. It introduces the M3E (Moka Massive Mixed Embedding) model series, which are trained and fine-tuned for Chinese text. The uniem library includes a FineTuner class that simplifies adapting pre-trained models to custom datasets using various fine-tuning techniques like Prefix Tuning.

Quick Start & Requirements

Install via pip: pip install sentence-transformers

To use M3E models:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("moka-ai/m3e-base")
embeddings = model.encode(['Hello World!', '你好,世界!'])

For local fine-tuning, create a conda environment: conda create -n uniem python=3.10 and pip install uniem.
Official quick-start and fine-tuning tutorials are linked within the README.

Highlighted Details

M3E models outperform OpenAI's text-embedding-ada-002 on Chinese text classification and retrieval tasks.
Introduces MTEB-zh, a standardized evaluation benchmark for Chinese embedding models, covering 6 model types and 9 datasets across text classification and retrieval.
FineTuner supports adapting models like M3E, sentence_transformers, and text2vec, and can train GPT models using SGPT or Prefix Tuning.

Maintenance & Community

The project is actively developed, with recent updates in July 2023 introducing enhanced fine-tuning capabilities and the MTEB-zh benchmark. Contributions for adding datasets or models to MTEB-zh are welcomed via issues or pull requests.

Licensing & Compatibility

Licensed under Apache-2.0, allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

The README mentions that for retrieval ranking benchmarks, only the top 10,000 articles from the T2Ranking dataset were used due to cost and time constraints with OpenAI's API. The FineTuner API has minor breaking changes from version 0.2.0.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days