Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Poetry instructions from dataset

Open huu4ontocord opened this issue 3 years ago • 10 comments

Transform these two datasets into instructions: https://huggingface.co/datasets/merve/poetry https://huggingface.co/datasets/matthh/gutenberg-poetry-corpus

Write me a poem about [keywords] in the style of [author] -> poem.

For a poem with these lines, complete the rest [lines] -> [rest of poem]

Write a [sentiment analyzed poem[ about [keywords] -> poem

Write me a [sentiment] poem that is entitled [paraphrase of title] -> poem

Write me a poem about [run t5 summary on poem] - poem

Also consider how you might do dialog steps.

Write me a poem ... -> [poem with some keywords removed]

Ok, but can you add more details about [keyword] -> write the poem with the keyword in tact

huu4ontocord avatar Jan 15 '23 05:01 huu4ontocord

I'd be happy to take this Sunday if no one else has gotten around to it by then. For the dialog example step you provided, we could try masking out and iteratively interpolating new words for the masked-out keyword tokens. Is that what you had in mind? Other instruction ideas:

  • Poem with whole line/stanza removed -> "Add a line about x" -> whole Poem
  • Paraphrased Poem -> "Make it in the style of x" -> original Poem

IsaacRe avatar Jan 21 '23 07:01 IsaacRe

@IsaacRe it's yours!

huu4ontocord avatar Jan 22 '23 03:01 huu4ontocord

So i think masking would work well (would produce semantically matched words). You could also find words that have similar ending rhymes (ends with ion, ent, etc. ) And yes, to all your suggestion.

The idea is a poetry helper.

You can also try "Write me a poem about {NER} in the genre of {genre}"

https://www.kaggle.com/datasets/jatindersehdev/poetry-analysis-data

If you really want to try something challenging, try adding poetry explanation (scraped from a website?). So, the poem would come first, and you could do all the above ^ and then you could ask the bot to explain the poem.

huu4ontocord avatar Jan 22 '23 03:01 huu4ontocord

@IsaacRe checking in on this.

huu4ontocord avatar Jan 27 '23 18:01 huu4ontocord

I experimented with a few models for keyphrase extraction and summarization/paraphrasing of the poems in the linked datasets. I found some success using this model for keyphrase extraction and have set up a few pipelines for dialogue tasks involving keyphrase extraction in my PR (I'll attach a few of the example outputs for reference).

I havent been able to get good results for summarization yet with any of the models I've tried--they tend to repeat the input verbatim and attempts at forcing output diversity using degeneration penalty leads to gibberish. I could try directly lowering output LLs for words present in the input (degeneration penalty works at the word embedding level but that's not really what we want here).

As you mentioned though, I think we will need some data explaining the poems in plain english since these models seem pretty confused by this data, overall. I found this poetry analysis site earlier today--it seems to have exactly what we'd want. I can go ahead and start scraping it this week.

validation.jsonl.txt train.jsonl.txt test.jsonl.txt

IsaacRe avatar Jan 30 '23 08:01 IsaacRe

Thank you for your excellent work! Can you push the dataset to hf and tell @Vechtomov where it's located? Looking forward to other work on poetry! @IsaacRe !!

huu4ontocord avatar Feb 04 '23 22:02 huu4ontocord

@Vechtomov I've generated train, test and val splits and pushed to https://huggingface.co/datasets/isaacrehg/poetry-instructions I finished crawling the above mentioned site and am working to aggregate the data collected. Will push up to hf soon.

IsaacRe avatar Feb 08 '23 19:02 IsaacRe

Hi, thanks. We don't need separation on train, test and validation. Can you combine all in one file?

Vechtomov avatar Feb 09 '23 08:02 Vechtomov

Yup, updated

IsaacRe avatar Feb 09 '23 23:02 IsaacRe

My circumstances have changed this month and I unfortunately won't have time to continue contributing, so updating with what I have in case someone else wants to pick this up.

I've organized the crawled poem analyses into two datasets: full-poem summarizations and per-stanza analyses

The first sentence of the summarizations tend to be something like " by is __ ". For example:

‘’Twas the old — road — through pain—’ by Emily Dickinson is a poem about the path one walks throughout life and toward death. ...

We could extract the object phrase of such first sentences to get a prompt like "Write me ". (Searching for "<poem/author> is " and grabbing content after should be enough for most records)

I think the detailed-analysis could be used similarly to get prompts for adding lines or stanzas to an existing poem, ie. "add a stanza about ".

The analysis content doesnt provide the full content of the poems in question, so there's still work to be done linking the poem content for each analysis. Joining with the merve or matth datasets on poem/author name would I think be the first thing to try.

IsaacRe avatar Feb 22 '23 05:02 IsaacRe