Poetry instructions from dataset
Transform these two datasets into instructions: https://huggingface.co/datasets/merve/poetry https://huggingface.co/datasets/matthh/gutenberg-poetry-corpus
Write me a poem about [keywords] in the style of [author] -> poem.
For a poem with these lines, complete the rest [lines] -> [rest of poem]
Write a [sentiment analyzed poem[ about [keywords] -> poem
Write me a [sentiment] poem that is entitled [paraphrase of title] -> poem
Write me a poem about [run t5 summary on poem] - poem
Also consider how you might do dialog steps.
Write me a poem ... -> [poem with some keywords removed]
Ok, but can you add more details about [keyword] -> write the poem with the keyword in tact
I'd be happy to take this Sunday if no one else has gotten around to it by then. For the dialog example step you provided, we could try masking out and iteratively interpolating new words for the masked-out keyword tokens. Is that what you had in mind? Other instruction ideas:
- Poem with whole line/stanza removed -> "Add a line about x" -> whole Poem
- Paraphrased Poem -> "Make it in the style of x" -> original Poem
@IsaacRe it's yours!
So i think masking would work well (would produce semantically matched words). You could also find words that have similar ending rhymes (ends with ion, ent, etc. ) And yes, to all your suggestion.
The idea is a poetry helper.
You can also try "Write me a poem about {NER} in the genre of {genre}"
https://www.kaggle.com/datasets/jatindersehdev/poetry-analysis-data
If you really want to try something challenging, try adding poetry explanation (scraped from a website?). So, the poem would come first, and you could do all the above ^ and then you could ask the bot to explain the poem.
@IsaacRe checking in on this.
I experimented with a few models for keyphrase extraction and summarization/paraphrasing of the poems in the linked datasets. I found some success using this model for keyphrase extraction and have set up a few pipelines for dialogue tasks involving keyphrase extraction in my PR (I'll attach a few of the example outputs for reference).
I havent been able to get good results for summarization yet with any of the models I've tried--they tend to repeat the input verbatim and attempts at forcing output diversity using degeneration penalty leads to gibberish. I could try directly lowering output LLs for words present in the input (degeneration penalty works at the word embedding level but that's not really what we want here).
As you mentioned though, I think we will need some data explaining the poems in plain english since these models seem pretty confused by this data, overall. I found this poetry analysis site earlier today--it seems to have exactly what we'd want. I can go ahead and start scraping it this week.
Thank you for your excellent work! Can you push the dataset to hf and tell @Vechtomov where it's located? Looking forward to other work on poetry! @IsaacRe !!
@Vechtomov I've generated train, test and val splits and pushed to https://huggingface.co/datasets/isaacrehg/poetry-instructions I finished crawling the above mentioned site and am working to aggregate the data collected. Will push up to hf soon.
Hi, thanks. We don't need separation on train, test and validation. Can you combine all in one file?
Yup, updated
My circumstances have changed this month and I unfortunately won't have time to continue contributing, so updating with what I have in case someone else wants to pick this up.
I've organized the crawled poem analyses into two datasets: full-poem summarizations and per-stanza analyses
The first sentence of the summarizations tend to be something like "
‘’Twas the old — road — through pain—’ by Emily Dickinson is a poem about the path one walks throughout life and toward death. ...
We could extract the object phrase of such first sentences to get a prompt like "Write me
I think the detailed-analysis could be used similarly to get prompts for adding lines or stanzas to an existing poem, ie. "add a stanza about
The analysis content doesnt provide the full content of the poems in question, so there's still work to be done linking the poem content for each analysis. Joining with the merve or matth datasets on poem/author name would I think be the first thing to try.