Open-Assistant Poetry instructions from dataset

Transform these two datasets into instructions: https://huggingface.co/datasets/merve/poetry https://huggingface.co/datasets/matthh/gutenberg-poetry-corpus

Write me a poem about [keywords] in the style of [author] -> poem.

For a poem with these lines, complete the rest [lines] -> [rest of poem]

Write a [sentiment analyzed poem[ about [keywords] -> poem

Write me a [sentiment] poem that is entitled [paraphrase of title] -> poem

Write me a poem about [run t5 summary on poem] - poem

Also consider how you might do dialog steps.

Write me a poem ... -> [poem with some keywords removed]

Ok, but can you add more details about [keyword] -> write the poem with the keyword in tact

Jan 15 '23 05:01 huu4ontocord

I'd be happy to take this Sunday if no one else has gotten around to it by then. For the dialog example step you provided, we could try masking out and iteratively interpolating new words for the masked-out keyword tokens. Is that what you had in mind? Other instruction ideas:

Poem with whole line/stanza removed -> "Add a line about x" -> whole Poem
Paraphrased Poem -> "Make it in the style of x" -> original Poem

Jan 21 '23 07:01 IsaacRe

@IsaacRe it's yours!

Jan 22 '23 03:01 huu4ontocord

So i think masking would work well (would produce semantically matched words). You could also find words that have similar ending rhymes (ends with ion, ent, etc. ) And yes, to all your suggestion.

The idea is a poetry helper.

You can also try "Write me a poem about {NER} in the genre of {genre}"

https://www.kaggle.com/datasets/jatindersehdev/poetry-analysis-data

If you really want to try something challenging, try adding poetry explanation (scraped from a website?). So, the poem would come first, and you could do all the above ^ and then you could ask the bot to explain the poem.

Jan 22 '23 03:01 huu4ontocord

@IsaacRe checking in on this.

Jan 27 '23 18:01 huu4ontocord

I experimented with a few models for keyphrase extraction and summarization/paraphrasing of the poems in the linked datasets. I found some success using this model for keyphrase extraction and have set up a few pipelines for dialogue tasks involving keyphrase extraction in my PR (I'll attach a few of the example outputs for reference).

I havent been able to get good results for summarization yet with any of the models I've tried--they tend to repeat the input verbatim and attempts at forcing output diversity using degeneration penalty leads to gibberish. I could try directly lowering output LLs for words present in the input (degeneration penalty works at the word embedding level but that's not really what we want here).

As you mentioned though, I think we will need some data explaining the poems in plain english since these models seem pretty confused by this data, overall. I found this poetry analysis site earlier today--it seems to have exactly what we'd want. I can go ahead and start scraping it this week.

validation.jsonl.txt train.jsonl.txt test.jsonl.txt

Jan 30 '23 08:01 IsaacRe

Thank you for your excellent work! Can you push the dataset to hf and tell @Vechtomov where it's located? Looking forward to other work on poetry! @IsaacRe !!

Feb 04 '23 22:02 huu4ontocord

@Vechtomov I've generated train, test and val splits and pushed to https://huggingface.co/datasets/isaacrehg/poetry-instructions I finished crawling the above mentioned site and am working to aggregate the data collected. Will push up to hf soon.

Feb 08 '23 19:02 IsaacRe

Hi, thanks. We don't need separation on train, test and validation. Can you combine all in one file?

Feb 09 '23 08:02 Vechtomov

Yup, updated

Feb 09 '23 23:02 IsaacRe

My circumstances have changed this month and I unfortunately won't have time to continue contributing, so updating with what I have in case someone else wants to pick this up.

I've organized the crawled poem analyses into two datasets: full-poem summarizations and per-stanza analyses

The first sentence of the summarizations tend to be something like " by is __ ". For example:

‘’Twas the old — road — through pain—’ by Emily Dickinson is a poem about the path one walks throughout life and toward death. ...

We could extract the object phrase of such first sentences to get a prompt like "Write me

I think the detailed-analysis could be used similarly to get prompts for adding lines or stanzas to an existing poem, ie. "add a stanza about ".

The analysis content doesnt provide the full content of the poems in question, so there's still work to be done linking the poem content for each analysis. Joining with the merve or matth datasets on poem/author name would I think be the first thing to try.

Feb 22 '23 05:02 IsaacRe