Text encoder for GPT-2 keyword-based text generation
Top 98.2% on sourcepulse
This repository provides a method for encoding text documents to enable GPT-2 to generate text conditioned on specific keywords. It is designed for researchers and developers working with large language models who want to control text generation based on semantic input. The primary benefit is improved relevance and specificity in generated text.
How It Works
The core approach involves an unsupervised keyword extraction process using spaCy, focusing on nouns, verbs, adjectives, and adverbs. Keywords are then normalized and shuffled to prevent GPT-2 from learning positional cues. Data augmentation is achieved by creating multiple random keyword combinations per document, with a subset of keywords randomly selected and shuffled for each training instance. This process aims to enhance GPT-2's ability to generate contextually relevant text based on the provided keywords.
Quick Start & Requirements
pip install -r requirements.txt
.example/
folder.Highlighted Details
ray
for parallelized encoding, offering significant speedups on large datasets.category
, keywords
, title
, and body
scopes.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The scope of the input text plus keywords must remain within GPT-2's 1023 token limit. The effectiveness of manual keywords may vary, and balanced category distribution is recommended to prevent sampling bias.
4 years ago
1 week