Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Alpaca 52K

Open coen22 opened this issue 2 years ago • 5 comments

Hey everyone,

This dataset might be useful for training OA: https://github.com/tatsu-lab/stanford_alpaca However it is legally a bit dubious, as the dataset is created using text-davinci-003. OpenAI states that it is not permitted to use their software to create competitive products. Although, right now it is publicly shared data that one can obtain without using OpenAI's API.

coen22 avatar Mar 21 '23 14:03 coen22

We can generate dataset using open-source LLMs also. Then we can use it.

satani99 avatar Mar 22 '23 04:03 satani99

Well, I believe currently there aren’t any publicly available models that are fine tuned using the InstructGPT, are there?

Though it could be nice to have galactica or llama integrated in the open assistant ui, so users that answer prompts only have to check if the generated response is correct.

coen22 avatar Mar 22 '23 08:03 coen22

We can generate dataset using open-source LLMs also. Then we can use it.

If possible this would be great, e.g. has someone already tried what LLaMA 65B can generate with right prompts?

andreaskoepf avatar Mar 22 '23 09:03 andreaskoepf

There is a much improved, cleaned and curated version of the alpaca dataset here: https://github.com/gururise/AlpacaDataCleaned

gururise avatar Mar 23 '23 06:03 gururise

It's funny because I'm pretty sure most people in the OpenAssistant platform use ChatGPT to write the Assistant replies

IllusionDX avatar Apr 30 '23 01:04 IllusionDX