Utilities for decoding deep representations (sentence embeddings) back to text
Top 40.7% on sourcepulse
This library provides utilities for text embedding inversion, enabling the reconstruction of text sequences from their vector representations. It's designed for researchers and practitioners working with text embeddings who need to recover or approximate the original text, offering capabilities for both using pre-trained models and training custom inversion architectures.
How It Works
The core approach involves training sequence-to-sequence models, often based on the Transformer architecture, to map embeddings back to text. It utilizes a two-stage process: first, a "hypothesizer" model generates an initial text guess from an embedding, and then a "corrector" model iteratively refines this guess by re-embedding intermediate text and adjusting it to match the target embedding. This iterative refinement, optionally enhanced with beam search, aims to improve accuracy and coherence.
Quick Start & Requirements
pip install vec2text
nltk
for tokenization (nltk.download('punkt')
).pre-commit
for code quality.Highlighted Details
text-embedding-ada-002
and gtr-base
embeddings.Maintenance & Community
The project is associated with research from John X. Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and Alexander M. Rush. Further community engagement details (e.g., Discord/Slack) are not explicitly mentioned in the README.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
Training custom models, especially on large datasets like MSMARCO, can be computationally intensive and require significant disk space (e.g., 54 GB for full-precision ada-2 embeddings). The README notes that sequence-level beam search can consume substantial GPU memory if sequence_beam_width
is set too high.
2 days ago
1 day