SubredditSimulator icon indicating copy to clipboard operation
SubredditSimulator copied to clipboard

Many generated sentences contain unbalanced punctuation/markdown

Open Deimos opened this issue 10 years ago • 0 comments

markovify actually throws out any sentences including quotes, parentheses or square brackets by default because they tend to end up unbalanced in the generated sentences. I overrode that behavior because it was removing a huge number of sentences from the training, like almost every single title in /r/relationships and most comments from /r/scenesfromahat. But by doing that I've ended up with the result it was trying to avoid - a lot of unmatched ones in the output.

Main things to try to fix with this:

  • Quotes - both double-quotes and single-quotes (need to distinguish from apostrophes)
  • Parentheses
  • Square brackets (especially as markdown link text)
  • Asterisks being used for bold and italic markdown

Deimos avatar Aug 19 '15 14:08 Deimos