Alpaca 52K
Hey everyone,
This dataset might be useful for training OA: https://github.com/tatsu-lab/stanford_alpaca
However it is legally a bit dubious, as the dataset is created using text-davinci-003.
OpenAI states that it is not permitted to use their software to create competitive products.
Although, right now it is publicly shared data that one can obtain without using OpenAI's API.
We can generate dataset using open-source LLMs also. Then we can use it.
Well, I believe currently there aren’t any publicly available models that are fine tuned using the InstructGPT, are there?
Though it could be nice to have galactica or llama integrated in the open assistant ui, so users that answer prompts only have to check if the generated response is correct.
We can generate dataset using open-source LLMs also. Then we can use it.
If possible this would be great, e.g. has someone already tried what LLaMA 65B can generate with right prompts?
There is a much improved, cleaned and curated version of the alpaca dataset here: https://github.com/gururise/AlpacaDataCleaned
It's funny because I'm pretty sure most people in the OpenAssistant platform use ChatGPT to write the Assistant replies