Data source: TED-talk transcripts
As the content of the TED-talks is highly inspiring I suggest to use the offered transcripts of the talks for Open Assistant.
https://www.ted.com/search?cat=pages&q=TED+talks+with+transcript%28transcript%29
It’s covered by the license. https://creativecommons.org/licenses/by-nc-nd/3.0/ Don’t know if the usage for an LM applies to this.
TED Talks are monologues, I imagine it would be hard to convert it to a Question-Answer format required to train a chat model.
E.g. "Hey Open assistant please write a funny lecture/presentation about procrastination"
Do you want to try converting to dialog? Maybe come up with creative ways to parse the text to turn into q/a style?
I am very sorry to say, that my coding-abilities are not sufficient to do so. I could try to think of ideas for the parsing but not for it’s implementation. I also could try to do it manually at least for some presentations. But may I ask - wasn’t it sufficient to just use each presentation as the answer to a prompt asking for such presentation? Additionally an expert might have to check first, if using the transcripts as data-feed for LM is covered by the license.
Here it is the license: https://www.ted.com/about/our-organization/our-policies-terms/ted-talks-usage-policy.
And here the relative about Transcript:
"Transcripts and subtitles may be used under the same Creative Commons license in conjunction with the TED Talk video. Copyright on the transcripts is owned by TED and any edits, alternate usage rights or changes to these documents are not permitted without permission. Therefore, if you wanted to publish a TED Talk in a book, test, play, or any other publication, permission is required."
It seems quite closed corporately.
I'll make a request
Hi all what is the status on this? Are we continuing this or if not, i can close the issue. thank you for looking into this :)
@ontocord I made a request through the TED-inquiry-form but an answer is still pending.
Thank you!
Unfortunately they want to charge the use of the transcipts. Sry that I did not tell earlier.