BERT-flow  by bohanli

TensorFlow code for sentence embeddings research paper

created 4 years ago
532 stars

Top 60.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a TensorFlow implementation of the EMNLP 2020 paper "On the Sentence Embeddings from Pre-trained Language Models." It offers a method to improve sentence embeddings derived from pre-trained language models like BERT, targeting researchers and practitioners in Natural Language Processing seeking enhanced semantic representation for sentences. The key benefit is achieving state-of-the-art performance on sentence similarity tasks.

How It Works

The project implements a "flow" mechanism, a generative model approach, to refine sentence embeddings. This involves fine-tuning pre-trained BERT models using Natural Language Inference (NLI) supervision. The core idea is to learn a transformation (the "flow") that maps BERT's raw sentence representations to a more semantically meaningful space, improving performance on tasks like semantic textual similarity (STS).

Quick Start & Requirements

  • Install: Clone the repository and set up environment variables for model and data directories.
  • Prerequisites: Python >= 3.6, TensorFlow >= 1.14. Requires downloading pre-trained BERT models (base and large) and GLUE benchmark datasets (specifically STS-B).
  • Setup: Requires downloading models and datasets, which can take time depending on network speed and file sizes.
  • Links: BERT Models, GLUE Benchmark, SentEval.

Highlighted Details

  • Achieves Spearman's rho of 81.18 on STS-B using BERT-large-NLI-flow (trained on target data).
  • Supports fine-tuning BERT with NLI supervision for improved embeddings.
  • Enables unsupervised learning of flow-based generative models for sentence embeddings.
  • Provides scripts for both training and prediction/evaluation.

Maintenance & Community

The project is associated with authors from CMU. Contact information is provided for questions. No explicit community channels (like Discord/Slack) or roadmap are mentioned.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, it acknowledges borrowing heavily from projects like google-research/bert, zihangdai/xlnet, and tensorflow/tensor2tensor, which have varying licenses. Compatibility for commercial use or closed-source linking would require clarification of the specific license applied to this codebase.

Limitations & Caveats

The implementation is specific to TensorFlow 1.x. The README does not detail support for newer TensorFlow versions or other frameworks like PyTorch. The setup involves manual downloading of large pre-trained models and datasets.

Health Check
Last commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Didier Lopes Didier Lopes(Founder of OpenBB), and
11 more.

sentence-transformers by UKPLab

0.2%
17k
Framework for text embeddings, retrieval, and reranking
created 6 years ago
updated 3 days ago
Feedback? Help us improve.