Memory usage on large inputs

Open melisgl opened this issue 2 years ago • 2 comments

I'm using the per-block implementation in parse-doc, but it's still fairly easy to run out of memory with large %blocks with something like this:

CL-USER> (time
          (let ((input (with-output-to-string (out)
                         (loop repeat 100000
                               do (format out "- ~A ~A ~A ~A~%"
                                          (random 1000000) (random 1000000)
                                          (random 1000000) (random 1000000))))))
            (3bmd-grammar::parse-doc input)
            (length input)))
Evaluation took:
  12.364 seconds of real time
  12.371129 seconds of total run time (11.750773 user, 0.620356 system)
  [ Run times consist of 5.771 seconds GC time, and 6.601 seconds non-GC time. ]
  100.06% CPU
  37,030,481,562 processor cycles
  15,570,202,368 bytes consed
  
2955662
CL-USER> (/ 15570202368 2955662.0)
5267.924

This example uses a bulleted list because it is probably the worst offender, but a large paragraph behaves similarly.

According to time, consing scales linearly with the number of repeats, which is good. Perhaps 5267 bytes per character is too high, but I suspect that the main problem is that maximum size of the working set also scales linearly.

Jun 09 '23 07:06 melisgl

Yeah, memory use can be a problem for packrat parsers, since they memoize a lot in return for linear runtime. From the packrat thesis:

The primary disadvantage of packrat parsing is its storage cost, which is a constant multiple of the total input size rather than being proportional to the nesting depth of the syntactic constructs appearing in the input.

Best fix would probably be to replace it with a CommonMark parser, since that is specified so it can be parsed by lines. As mentioned in #53 though, that would require rewriting things and probably would only be partially compatible with existing 3bmd code at best.

You can see what it is doing with

(esrap:trace-rule '3bmd-grammar::doc :recursive t)

before running a parse (try with small inputs, one line of the list from your test is about 1500 lines of output). (esrap:untrace-all-rules) to turn it off again.

If I remember how packrat works, each of the | or -> in the trace is memoized, which is why there is so much overhead.

And now that I look at the trace, I remember that list items (and probably a few other things) do recursive parses, so it is repeating some work but a lot of that consing is actually getting thrown away at the end of the recursive parse.

Possibly adding (:text t) to the raw-line rule would help with working set if you have some way to test that? (Doesn't seem to affect consing, but I think it would hold on to much less data that way without adding much processing overhead.)

There might be some other similar places where we can reduce the size of the cache by throwing away things that are only there for the grammar and not needed in the output, will have to look into that at some point. Might also be some places where we can reorder or add rules to avoid testing things that could be proven impossible. For example it seems to be doing extra work to determine there isn't a block at eof, which affects consing buts gets thrown away at the end of the recursive parse so doesn't affect working set much.

(defrule %block (and (! eof) (* blank-line)
                       #.(cons 'or %block-rules%))
  (:destructure (eof blank block)
                (declare (ignore eof blank))
                block))
;; define-extension-block needs to be updated to match as well, I think?

would avoid that if you want to try it, but need to think/test a bit more to be sure it doesn't have any side effects. Maybe similar for inlines?

Jun 11 '23 23:06 3b