vec2text by vec2text

Utilities for decoding deep representations (sentence embeddings) back to text

Created 2 years ago

1,044 stars

Top 36.0% on SourcePulse

View on GitHub

5 Experts Love This Project

Research Scientist at Cursor; Professor at Cornell Tech

Travis Fischer

Founder of Agentic

and 1 more!

Project Summary

This library provides utilities for text embedding inversion, enabling the reconstruction of text sequences from their vector representations. It's designed for researchers and practitioners working with text embeddings who need to recover or approximate the original text, offering capabilities for both using pre-trained models and training custom inversion architectures.

How It Works

The core approach involves training sequence-to-sequence models, often based on the Transformer architecture, to map embeddings back to text. It utilizes a two-stage process: first, a "hypothesizer" model generates an initial text guess from an embedding, and then a "corrector" model iteratively refines this guess by re-embedding intermediate text and adjusting it to match the target embedding. This iterative refinement, optionally enhanced with beam search, aims to improve accuracy and coherence.

Quick Start & Requirements

Install via pip: pip install vec2text
Requires nltk for tokenization (nltk.download('punkt')).
Development requires pre-commit for code quality.
GPU is recommended for inference and essential for training. CUDA is not explicitly required but implied for GPU usage.
Colab demo available: Link to Colab Demo

Highlighted Details

Supports inversion for OpenAI's text-embedding-ada-002 and gtr-base embeddings.
Enables training custom inversion models using datasets like MSMARCO and "one-million-instructions".
Includes functionality for embedding interpolation to explore semantic spaces.
Provides utilities for evaluating model performance and uploading trained models to Hugging Face.

Maintenance & Community

The project is associated with research from John X. Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and Alexander M. Rush. Further community engagement details (e.g., Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Training custom models, especially on large datasets like MSMARCO, can be computationally intensive and require significant disk space (e.g., 54 GB for full-precision ada-2 embeddings). The README notes that sequence-level beam search can consume substantial GPU memory if sequence_beam_width is set too high.

Health Check

Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

25 stars in the last 30 days