Open-Assistant Alpaca 52K

Hey everyone,

This dataset might be useful for training OA: https://github.com/tatsu-lab/stanford_alpaca However it is legally a bit dubious, as the dataset is created using text-davinci-003. OpenAI states that it is not permitted to use their software to create competitive products. Although, right now it is publicly shared data that one can obtain without using OpenAI's API.

Mar 21 '23 14:03 coen22

We can generate dataset using open-source LLMs also. Then we can use it.

Mar 22 '23 04:03 satani99

Well, I believe currently there aren’t any publicly available models that are fine tuned using the InstructGPT, are there?

Though it could be nice to have galactica or llama integrated in the open assistant ui, so users that answer prompts only have to check if the generated response is correct.

Mar 22 '23 08:03 coen22

We can generate dataset using open-source LLMs also. Then we can use it.

If possible this would be great, e.g. has someone already tried what LLaMA 65B can generate with right prompts?

Mar 22 '23 09:03 andreaskoepf

There is a much improved, cleaned and curated version of the alpaca dataset here: https://github.com/gururise/AlpacaDataCleaned

Mar 23 '23 06:03 gururise

It's funny because I'm pretty sure most people in the OpenAssistant platform use ChatGPT to write the Assistant replies

Apr 30 '23 01:04 IllusionDX