vec2text  by vec2text

Utilities for decoding deep representations (sentence embeddings) back to text

created 2 years ago
913 stars

Top 40.7% on sourcepulse

GitHubView on GitHub
Project Summary

This library provides utilities for text embedding inversion, enabling the reconstruction of text sequences from their vector representations. It's designed for researchers and practitioners working with text embeddings who need to recover or approximate the original text, offering capabilities for both using pre-trained models and training custom inversion architectures.

How It Works

The core approach involves training sequence-to-sequence models, often based on the Transformer architecture, to map embeddings back to text. It utilizes a two-stage process: first, a "hypothesizer" model generates an initial text guess from an embedding, and then a "corrector" model iteratively refines this guess by re-embedding intermediate text and adjusting it to match the target embedding. This iterative refinement, optionally enhanced with beam search, aims to improve accuracy and coherence.

Quick Start & Requirements

  • Install via pip: pip install vec2text
  • Requires nltk for tokenization (nltk.download('punkt')).
  • Development requires pre-commit for code quality.
  • GPU is recommended for inference and essential for training. CUDA is not explicitly required but implied for GPU usage.
  • Colab demo available: Link to Colab Demo

Highlighted Details

  • Supports inversion for OpenAI's text-embedding-ada-002 and gtr-base embeddings.
  • Enables training custom inversion models using datasets like MSMARCO and "one-million-instructions".
  • Includes functionality for embedding interpolation to explore semantic spaces.
  • Provides utilities for evaluating model performance and uploading trained models to Hugging Face.

Maintenance & Community

The project is associated with research from John X. Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and Alexander M. Rush. Further community engagement details (e.g., Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Training custom models, especially on large datasets like MSMARCO, can be computationally intensive and require significant disk space (e.g., 54 GB for full-precision ada-2 embeddings). The README notes that sequence-level beam search can consume substantial GPU memory if sequence_beam_width is set too high.

Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
1
Star History
119 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.