gpt-2-keyword-generation  by minimaxir

Text encoder for GPT-2 keyword-based text generation

created 6 years ago
260 stars

Top 98.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a method for encoding text documents to enable GPT-2 to generate text conditioned on specific keywords. It is designed for researchers and developers working with large language models who want to control text generation based on semantic input. The primary benefit is improved relevance and specificity in generated text.

How It Works

The core approach involves an unsupervised keyword extraction process using spaCy, focusing on nouns, verbs, adjectives, and adverbs. Keywords are then normalized and shuffled to prevent GPT-2 from learning positional cues. Data augmentation is achieved by creating multiple random keyword combinations per document, with a subset of keywords randomly selected and shuffled for each training instance. This process aims to enhance GPT-2's ability to generate contextually relevant text based on the provided keywords.

Quick Start & Requirements

  • Install via pip install -r requirements.txt.
  • Requires Python 3.6+ and spaCy.
  • Example usage is available in the example/ folder.

Highlighted Details

  • Leverages spaCy for robust, POS-aware keyword tokenization.
  • Utilizes ray for parallelized encoding, offering significant speedups on large datasets.
  • Supports hierarchical conditioning with category, keywords, title, and body scopes.
  • Keywords are not guaranteed to appear in the generated text 100% of the time.

Maintenance & Community

  • Maintained by Max Woolf (@minimaxir).
  • Project is supported via Patreon.

Licensing & Compatibility

  • MIT License.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The scope of the input text plus keywords must remain within GPT-2's 1023 token limit. The effectiveness of manual keywords may vary, and balanced category distribution is recommended to prevent sampling bias.

Health Check
Last commit

4 years ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
1 more.

tokenmonster by alasdairforsythe

0.7%
594
Subword tokenizer and vocabulary trainer for multiple languages
created 2 years ago
updated 1 year ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Travis Fischer Travis Fischer(Founder of Agentic), and
5 more.

setfit by huggingface

0.3%
3k
Few-shot learning framework for Sentence Transformers
created 3 years ago
updated 3 months ago
Feedback? Help us improve.