2020 icon indicating copy to clipboard operation
2020 copied to clipboard

Naked Fear, Loathing, Pride, Prejudice, and Brunch at Tiffany's (in Las Vegas).

Open hornc opened this issue 5 years ago • 0 comments

This is going to a continuation of my ideas from last year in https://github.com/NaNoGenMo/2019/issues/65

Treating vocabularies as numbering systems, and works composed from them as large numbers, to be manipulated.

Following some very good advice last year I switched focus towards the end of the month to ensuring I actually had 50k words in some kind of format that was readable, rather than bug free code that was pure and true to a half-baked concept that only I was judging on. It was a good exercise in project management: focus on the results that matter.

I was happy enough with the results last year. Some of the bugs / issues with the tokenisation of the source material seemed to make the output more interesting, and my attempts last year to fix it resulted in (if I remember correctly) less interesting output, so I embraced the glitches and accomplished the goal of producing a generated novel using a simple arithmetic operation on a text.

This round I want to:

  • Generalise the tokenisation to be robust against many kinds of input (I'll be using a mix of properly edited text and some OCR'd source content)
  • Work on formalising the tokenisation algorithm so it is repeatable / comprehensible
  • Overcome the challenge of converting a > 100K word text like Pride and Prejudice into an integer. With the current code this requires more than 4 gig of RAM
  • Work on a shared vocab across more than one source work (4) and do some more interesting averaging or combinations.
  • Figure out if there is a conceptually pure way to make the text output interesting, or whether the output will really be as interesting as reading a large integer.

hornc avatar Nov 04 '20 02:11 hornc