Open-Assistant Create recipes dataset

Food brings us together, so I'm excited for us to explore pulling together an amazing recipe dataset.

Ingredient lists and basic directions are not subject to copyright (see page 2 https://www.copyright.gov/circs/circ33.pdf).

Let's evaluate existing recipe data sets (checking for quality, diversity and license): https://huggingface.co/datasets?search=recipe

Depending on the quality of the huggingface datasets, we can also explore online recipes for ingredient list and method. Be aware of overlap, as many online recipes are likely to be in the huggingface datasets.

We can look at creative ways to transform these into conversations, e.g:

Split up ingredients from method with a question(s) in between.
Add follow up questions like 'What temperature should the oven be?', 'How many grams of flour do I need?' (are we allowed to use a simple OSS model to generate these responses?)
Index non-stopwords from poems: https://huggingface.co/datasets/matthh/gutenberg-poetry-corpus, and mix in a bit of a poem at the end or beginning that might overlap with non-stopwords in the recipe (wonderful idea from @ontocord)
More fun ideas please!

Format is currently bullet point char for list, numbers for method. Example:


Chip: Ingredients:
 • 2 racks of baby back pork ribs (about 4 pounds each)
 • 1 bottle of your favorite barbecue sauce
 • 1/2 cup brown sugar
 • 1 tablespoon garlic powder
 • 1 teaspoon onion powder
 • 1 teaspoon paprika
 • 1 teaspoon cumin
 • Salt and pepper to taste

Instructions:
1. Preheat the oven to 350 degrees Fahrenheit.
2. Place the ribs on a rimmed baking sheet, meat side up.
3. In a bowl, mix together the barbecue sauce, brown sugar, garlic powder, onion powder, paprika, cumin, salt, and pepper.
4. Spoon about 1/4 cup of the sauce mixture over the top of each rack of ribs.
5. Bake in preheated oven for 30 minutes or until the ribs are tender.
6. Serve with remaining sauce on the side. Enjoy!```

Jan 31 '23 07:01 dctanner

Let me know if you would like to add, a cocktail data set. I could create a new issue to workout the format of the data I currently have. It was parsed from 12 classic cocktail books in 2015 for an abandoned project.

Feb 06 '23 23:02 BrianArbuckle

I should add that there are a handful of "made famous by" items. Might be a nice item to add.

Feb 07 '23 00:02 BrianArbuckle

Yes. @BrianArbuckle please create a gh issue and I will assign to you. Sounds like alot of fun --- cheers!

Feb 07 '23 04:02 huu4ontocord

Hi @ontocord thanks! It is up #1286

Feb 07 '23 05:02 BrianArbuckle

Looking forward to the cocktails @BrianArbuckle!

I've settled on https://huggingface.co/datasets/recipe_nlg being the best dataset. In particular the items labelled 'Gathered' which are higher quality (less mistakes in measurement units).

I've made a handful of User prompt templates, and plan to randomly assign each to a recipe. In ONE_STEP_TEMPLATES the Assistant will reply with the full ingredients and method. In TWO_STEP_TEMPLATES_1 and TWO_STEP_TEMPLATES_2 the Assistant will reply first with the ingredients list and then the follow up prompt returns the method.

Any suggestions for improving this are welcome!

Feb 08 '23 16:02 dctanner

Hi @dctanner looks great!

I have grappled with the question of what to do with the same title and different recipe. Unique ids solve the question, and that may be enough. I have toyed with a slug concept like: old-fashioned-02 for the third Old Fashioned recipe.

Also, I have tidbits for a small percentage of the cocktails, such as "The French 75 was made famous by Humphrey Bogart in Casablanca." Or this version is the official IBA recipe." OR "The IBA category is The Unforgettables ." Also recommended glassware. These are all columns in my database table, yet 75%+ plus are empty.

From a ML perspective, I would prefer to not have a bunch of nans as I will most likely be dropping those features. Yet for the nuance of an LLM, I think it would be nice to have. So the question would be:

should I add a nested dictionary? I never like dealing with those in a dataset myself
create the extra keys, that will remain largely empty?
have a single "notes" key that would contain all the anecdotal information as a list of strings?

Feb 08 '23 17:02 BrianArbuckle

@BrianArbuckle simplest is usually best :) Maybe just keep these extra things as columns in your db (even though most will be blank). As I understand it, when preparing the data for training, we are just creating text which is an example dialog between the User and Assistant. So you can simply include the extra bits of information in the Assistant reply if it exists.

Feb 09 '23 11:02 dctanner