Search over YouTube subtitles
Couldn't figure out if this is already done or not.
We could download a youtube video's subtitles, store them in the DB and allow memex to search over that.
See also: https://github.com/open-source-ideas/open-source-ideas/issues/88
Hey @dufferzafar
That's a fantastic idea, wonder how that could be done easily. Best would be if those subtitles are already available in the DOM, when loading the subtitles section otherwise we would need to build a youtube integration, which makes things a bit more difficult. Would also be cool to abstract it away nicely so that other video services could be attached more easily. I see from your thread that Algolia is working on indexing those subtitles, so that might be another option -> integrating their search. Then we need a way to weed out results that you visited. How is it with the youtube API? Can you query it without being a user? If so that would make things a lot easier, because we don't have to handle the google account keys.
Just to note: Currently we don't have the manpower to build this unfortunately, as there are a couple of things we need to get done before we can work on integrations.
I have never once created a browser extension, so I don't really know how this would work, but if I were building the entirety of worldbrain/memex as a native application (where the browser extension itself would only do as little as possible - leaving out all the heavy work for a core), I would be free to use amazing tools like youtube-dl - which would deal with actually downloading of subs.
How is it with the youtube API? Can you query it without being a user? If so that would make things a lot easier, because we don't have to handle the google account keys.
You need to request a YouTube credential files in order to use the API. Those come with pretty generous rate limits that you would never actually reach as an individual. Once those thresholds are hit (I don't remember the exact number), you have to pay to go above, or wait the next day for your quota to be back to zero.
BUT, captions are not available through the API... For that, I had to use another undocumented weird endpoint: http://www.youtube.com/get_video_info?video_id=NFeRlR9dOwY. This contains a mix of json and xml and buried into all that there is a link to the XML file containing the captions. So I guess that technically, you might not need an API key to get the captions if you already know the videoId.
I think since this feature request opened there was a significant improvement in the APIs and libraries integrated with it. For example this package can be used: https://www.npmjs.com/package/youtube-captions-scraper
Super important feature in my view, as more and more media moves from written to spoken/visual