yt-semantic-search by transitive-bullshit

Semantic search tool for YouTube playlists

Created 3 years ago

538 stars

Top 59.0% on SourcePulse

View on GitHub

4 Experts Love This Project

Mckay Wrigley

Founder of Takeoff AI

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

This project provides a semantic search engine for YouTube playlists, enabling users to find specific moments within videos using natural language queries. It's designed for podcast listeners and content creators who want to improve content discovery and access.

How It Works

The system leverages OpenAI's text-embedding-ada-002 model to generate 1536-dimensional embeddings for chunks of YouTube video transcripts. These embeddings capture semantic meaning beyond keywords. A hosted Pinecone vector database is used for efficient k-NN searches across these embeddings, allowing for high-accuracy retrieval of relevant video segments. Transcripts are obtained via HTML scraping, with a TODO to integrate Whisper for improved accuracy.

Quick Start & Requirements

Install dependencies: npm install
Download transcripts: npx tsx src/bin/resolve-yt-playlist.ts
Process transcripts and create embeddings: npx tsx src/bin/process-yt-playlist.ts
Query the index: npx tsx src/bin/query.ts
Optional thumbnail generation: npx tsx src/bin/generate-thumbnails.ts (approx. 2 hours)
Prerequisites: Node.js, OpenAI API key, Pinecone API key.
Frontend development server: npm run dev
More details: Screenshots

Highlighted Details

Utilizes OpenAI's text-embedding-ada-002 for deep semantic understanding.
Employs Pinecone for efficient vector similarity search.
Includes optional timestamped thumbnail generation using Puppeteer and Google Cloud Storage.
Frontend built with Next.js and deployed on Vercel.

Maintenance & Community

Developed by Travis Fischer.
Project is not affiliated with the All-In Podcast.
Feedback can be provided via GitHub or Twitter.

Licensing & Compatibility

MIT License.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project relies on HTML scraping for YouTube transcripts, which may be fragile and miss some episodes lacking automated captions. A TODO item suggests using Whisper for more robust transcription. Thumbnail generation is resource-intensive and time-consuming.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days