YouTube Semantic Search Project
Intro
The YouTube Semantic Search project is an innovative tool designed to enhance the way users search and discover content within YouTube playlists, particularly with podcasts. The project leverages cutting-edge AI technology from OpenAI to create a semantic search index where users can find specific moments from their favorite podcasts with high accuracy. While the demo is focused on the "All-In Podcast," this tool can be adapted for any YouTube channel or playlist. The project addresses the common challenge of searching within podcast content, making the exploration process much more efficient and enjoyable.
How to Get Started
To utilize this project, users need to follow these steps:
- Clone the repository to their local computer.
- Open the terminal and navigate to the root directory of the repository.
- Install necessary dependencies using the command
npm install
. - Download English transcripts of the targeted playlist with
npx tsx src/bin/resolve-yt-playlist.ts
. - Pre-process the transcripts and obtain embeddings from OpenAI for insertion into a Pinecone search index with the command
npx tsx src/bin/process-yt-playlist.ts
. - To query the Pinecone search index, run
npx tsx src/bin/query.ts
. - Optionally, users can generate thumbnails for each video by running
npx tsx src/bin/generate-thumbnails.ts
. This step is more time-intensive, around two hours, and requires a stable internet connection. - Start the development server for the frontend of the project built with Next.js by running
npm run dev
.
Example Queries
Some practical and engaging queries might include:
- "sweater karen"
- "best advice for founders"
- "poker story from last night"
- "crypto scam ponzi scheme"
These examples demonstrate the varied types of content users can discover using the semantic search.
How It Works
The technology behind the YouTube Semantic Search includes several key components:
- OpenAI: Utilizes the advanced text-embedding-ada-002 model to interpret text deeply, allowing the search to go beyond simple keywords.
- Pinecone: A hosted vector search service is used for efficient k-NN searches across text embeddings.
- Vercel: Handles hosting and API functions.
- Next.js: Provides the React-based web application framework for the frontend.
The process begins by fetching videos using the YouTube API, focusing on playlists like the "All-In Podcast," which had 108 videos at the time of writing. English transcripts are captured through HTML scraping, as the YouTube API does not support non-OAuth access for captions. Transcripts are divided into 100-token chunks, and embeddings are fetched via OpenAI, resulting in about 200 embeddings per episode. These are stored in a Pinecone index with dimensionality set to 1536, totaling approximately 17,575 embeddings from the entire playlist.
Screenshots
The project supports both light and dark modes, enhancing the user experience and accessibility, which can be previewed with screenshots provided on the web application.
TODO
Future enhancements for the project include:
- Implementing Whisper to improve transcript accuracy.
- Adding functionality to sort query results by recency in addition to relevancy.
Feedback
The project creator welcomes feedback and ideas for improvement. Users can submit suggestions or report issues via the project's GitHub page or Twitter account.
Credit
The project draws inspiration from Riley Tomasek's work on enabling search for the Huberman YouTube Channel. It operates independently and is not affiliated with the All-In Podcast, although it processes data from their YouTube channel using AI.
License
This project is available under the MIT License, authored by Travis Fischer. Users who find the project useful or interesting are encouraged to support through GitHub sponsorship.