gpt-2-keyword-generation by minimaxir

Text encoder for GPT-2 keyword-based text generation

Created 6 years ago

260 stars

Top 97.7% on SourcePulse

Project Summary

This repository provides a method for encoding text documents to enable GPT-2 to generate text conditioned on specific keywords. It is designed for researchers and developers working with large language models who want to control text generation based on semantic input. The primary benefit is improved relevance and specificity in generated text.

How It Works

The core approach involves an unsupervised keyword extraction process using spaCy, focusing on nouns, verbs, adjectives, and adverbs. Keywords are then normalized and shuffled to prevent GPT-2 from learning positional cues. Data augmentation is achieved by creating multiple random keyword combinations per document, with a subset of keywords randomly selected and shuffled for each training instance. This process aims to enhance GPT-2's ability to generate contextually relevant text based on the provided keywords.

Quick Start & Requirements

Install via pip install -r requirements.txt.
Requires Python 3.6+ and spaCy.
Example usage is available in the example/ folder.

Highlighted Details

Leverages spaCy for robust, POS-aware keyword tokenization.
Utilizes ray for parallelized encoding, offering significant speedups on large datasets.
Supports hierarchical conditioning with category, keywords, title, and body scopes.
Keywords are not guaranteed to appear in the generated text 100% of the time.

Maintenance & Community

Maintained by Max Woolf (@minimaxir).
Project is supported via Patreon.

Licensing & Compatibility

MIT License.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

The scope of the input text plus keywords must remain within GPT-2's 1023 token limit. The effectiveness of manual keywords may vary, and balanced category distribution is recommended to prevent sampling bias.

Health Check

Last Commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days