Semantic search tool for YouTube playlists
Top 60.4% on sourcepulse
This project provides a semantic search engine for YouTube playlists, enabling users to find specific moments within videos using natural language queries. It's designed for podcast listeners and content creators who want to improve content discovery and access.
How It Works
The system leverages OpenAI's text-embedding-ada-002
model to generate 1536-dimensional embeddings for chunks of YouTube video transcripts. These embeddings capture semantic meaning beyond keywords. A hosted Pinecone vector database is used for efficient k-NN searches across these embeddings, allowing for high-accuracy retrieval of relevant video segments. Transcripts are obtained via HTML scraping, with a TODO to integrate Whisper for improved accuracy.
Quick Start & Requirements
npm install
npx tsx src/bin/resolve-yt-playlist.ts
npx tsx src/bin/process-yt-playlist.ts
npx tsx src/bin/query.ts
npx tsx src/bin/generate-thumbnails.ts
(approx. 2 hours)npm run dev
Highlighted Details
text-embedding-ada-002
for deep semantic understanding.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project relies on HTML scraping for YouTube transcripts, which may be fragile and miss some episodes lacking automated captions. A TODO item suggests using Whisper for more robust transcription. Thumbnail generation is resource-intensive and time-consuming.
2 years ago
1 week