uniem  by wangyuxinwhy

Unified embedding model for Chinese text

created 2 years ago
865 stars

Top 41.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides tools and models for creating unified Chinese text embeddings, targeting researchers and developers working with Chinese NLP tasks. It offers training, fine-tuning, and evaluation code, with models and datasets available on HuggingFace, aiming to deliver the best general-purpose Chinese text embedding models.

How It Works

The project leverages the sentence-transformers library for model compatibility and ease of use. It introduces the M3E (Moka Massive Mixed Embedding) model series, which are trained and fine-tuned for Chinese text. The uniem library includes a FineTuner class that simplifies adapting pre-trained models to custom datasets using various fine-tuning techniques like Prefix Tuning.

Quick Start & Requirements

  • Install via pip: pip install sentence-transformers
  • To use M3E models:
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("moka-ai/m3e-base")
    embeddings = model.encode(['Hello World!', '你好,世界!'])
    
  • For local fine-tuning, create a conda environment: conda create -n uniem python=3.10 and pip install uniem.
  • Official quick-start and fine-tuning tutorials are linked within the README.

Highlighted Details

  • M3E models outperform OpenAI's text-embedding-ada-002 on Chinese text classification and retrieval tasks.
  • Introduces MTEB-zh, a standardized evaluation benchmark for Chinese embedding models, covering 6 model types and 9 datasets across text classification and retrieval.
  • FineTuner supports adapting models like M3E, sentence_transformers, and text2vec, and can train GPT models using SGPT or Prefix Tuning.

Maintenance & Community

The project is actively developed, with recent updates in July 2023 introducing enhanced fine-tuning capabilities and the MTEB-zh benchmark. Contributions for adding datasets or models to MTEB-zh are welcomed via issues or pull requests.

Licensing & Compatibility

Licensed under Apache-2.0, allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

The README mentions that for retrieval ranking benchmarks, only the top 10,000 articles from the T2Ranking dataset were used due to cost and time constraints with OpenAI's API. The FineTuner API has minor breaking changes from version 0.2.0.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Didier Lopes Didier Lopes(Founder of OpenBB), and
17 more.

sentence-transformers by UKPLab

0.3%
17k
Framework for text embeddings, retrieval, and reranking
created 6 years ago
updated 1 week ago
Feedback? Help us improve.