binjs-ref icon indicating copy to clipboard operation
binjs-ref copied to clipboard

Assertion failure: assert next_code_length >= code_length when code_length overflows

Open arai-a opened this issue 6 years ago • 6 comments

Steps to reproduce:

  1. run the following Python script and redirect the output to test.js
    for i in range(1, 23):
        for j in range(0, int(1.7**i)):
            print '{};'.format(i)

(the result is 1.1MB of JS file)

  1. encode the JS file generated in step 1 with fbssdc

Actual result:

crash at https://github.com/binast/binjs-ref/blob/5cdddbcef1f1b185a4936ac996180aa2cce54571/fbssdc/model.py#L472

AssertionError: (2097151, 21), 20, 21

arai-a avatar Sep 10 '19 14:09 arai-a

Cc @dominiccooney

So, we have too many numbers in the table of numbers?

Yoric avatar Sep 10 '19 14:09 Yoric

it tries to assign 21-bit length code for 1 and 2 symbols.

we'll need to define the fallback algorithm for this kind of case. (IMO, this doesn't happen in usual wild case, so just falling back to assign same length code for all items should be fine)

arai-a avatar Sep 10 '19 14:09 arai-a

That algorithm is a non-normative illustration of assigning codes, don't take it as spec. We should add a note about the code length issue.

We can and should leave code length limiting algorithms as an implementation detail of the encoder, because there are slow, accurate algorithms and fast approximating algorithms and depending on the circumstances either could be preferable. We don't want to bake one into the spec and the decoder doesn't care.

If you need one to crib from you can dig up papers on length limited Huffman codes for the accurate algorithm or zstd has a fast approximation (probably many other Huffman compressors too.) Facebook does something like zstd in our Hack encoder.

On Tue, Sep 10, 2019, 11:35 PM arai-a [email protected] wrote:

it tries to assign 21-bit length code for 1 and 2 symbols.

we'll need to define the fallback algorithm for this kind of case. (IMO, this doesn't happen in usual wild case, so just falling back to assign same length code for all items should be fine)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/binast/binjs-ref/issues/440?email_source=notifications&email_token=AAANOUAT6TEVDHYZ25CBIBTQI6WDJA5CNFSM4IVIDCFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6LKB7A#issuecomment-529965308, or mute the thread https://github.com/notifications/unsubscribe-auth/AAANOUDW2CQMFD6N6VX6LFDQI6WDJANCNFSM4IVIDCFA .

dominiccooney avatar Sep 12 '19 21:09 dominiccooney

That algorithm is a non-normative illustration of assigning codes, don't take it as spec. We should add a note about the code length issue.

Ah, good to know. We're using it as spec all over SpiderMonkey for the time being :)

Yoric avatar Sep 13 '19 05:09 Yoric

Ok. A correct encoder should not output lengths > 20 bits and SpiderMonkey should reject files with 21+ bit length codes.

BTW 20 bits was an arbitrary choice and may be something to revisit. Some people may need a million strings... I hope not but I would not be surprised if there's a file out there that big.

FWIW Facebook does have files where the optimal code has lengths > 20 bits and hence we use that heuristic to limit code lengths.

On Fri, Sep 13, 2019, 2:27 PM David Teller [email protected] wrote:

That algorithm is a non-normative illustration of assigning codes, don't take it as spec. We should add a note about the code length issue.

Ah, good to know. We're using it as spec all over SpiderMonkey for the time being :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/binast/binjs-ref/issues/440?email_source=notifications&email_token=AAANOUCQHMNQYCF5QFEFPATQJMQDVA5CNFSM4IVIDCFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6T732A#issuecomment-531103208, or mute the thread https://github.com/notifications/unsubscribe-auth/AAANOUDDAO53YJPQB3ZVDETQJMQDVANCNFSM4IVIDCFA .

dominiccooney avatar Sep 13 '19 07:09 dominiccooney

Oh, got it, 20 is part of the spec, but the algorithm isn't.

So, given that there are real-world files where 20 isn't sufficient for an optimal code, do we wish to keep 20 as a hard limit? Or do we expect to become suboptimal for huge files (and possibly issue a warning telling the developers that their file is simply too large)?

Yoric avatar Sep 13 '19 10:09 Yoric