yt-semantic-search  by transitive-bullshit

Semantic search tool for YouTube playlists

created 2 years ago
531 stars

Top 60.4% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a semantic search engine for YouTube playlists, enabling users to find specific moments within videos using natural language queries. It's designed for podcast listeners and content creators who want to improve content discovery and access.

How It Works

The system leverages OpenAI's text-embedding-ada-002 model to generate 1536-dimensional embeddings for chunks of YouTube video transcripts. These embeddings capture semantic meaning beyond keywords. A hosted Pinecone vector database is used for efficient k-NN searches across these embeddings, allowing for high-accuracy retrieval of relevant video segments. Transcripts are obtained via HTML scraping, with a TODO to integrate Whisper for improved accuracy.

Quick Start & Requirements

  • Install dependencies: npm install
  • Download transcripts: npx tsx src/bin/resolve-yt-playlist.ts
  • Process transcripts and create embeddings: npx tsx src/bin/process-yt-playlist.ts
  • Query the index: npx tsx src/bin/query.ts
  • Optional thumbnail generation: npx tsx src/bin/generate-thumbnails.ts (approx. 2 hours)
  • Prerequisites: Node.js, OpenAI API key, Pinecone API key.
  • Frontend development server: npm run dev
  • More details: Screenshots

Highlighted Details

  • Utilizes OpenAI's text-embedding-ada-002 for deep semantic understanding.
  • Employs Pinecone for efficient vector similarity search.
  • Includes optional timestamped thumbnail generation using Puppeteer and Google Cloud Storage.
  • Frontend built with Next.js and deployed on Vercel.

Maintenance & Community

  • Developed by Travis Fischer.
  • Project is not affiliated with the All-In Podcast.
  • Feedback can be provided via GitHub or Twitter.

Licensing & Compatibility

  • MIT License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project relies on HTML scraping for YouTube transcripts, which may be fragile and miss some episodes lacking automated captions. A TODO item suggests using Whisper for more robust transcription. Thumbnail generation is resource-intensive and time-consuming.

Health Check
Last commit

2 years ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.