mitlm icon indicating copy to clipboard operation
mitlm copied to clipboard

Tokens beginning with # cause a crash when using count files

Open mjwillson opened this issue 11 years ago • 0 comments

(Reporting this here as well as https://code.google.com/p/mitlm/issues/detail?id=44 in case github gets more attention these days)

The crash only happens if the ngram order is higher than 1, and only if the # occurs at the start of a token.

I'm guessing this is because it interprets a # at the beginning of a line in a text counts file as a comment and skips it, meaning a unigram beginning with a # is missing from the term dictionary when it's encountered in a later bigram.

What steps will reproduce the problem?

$ estimate-ngram -wc counts -text <(echo 'a #hashtag')
0.001   Loading corpus /dev/fd/63...
0.002   Smoothing[1] = ModKN
0.002   Smoothing[2] = ModKN
0.002   Smoothing[3] = ModKN
0.002   Set smoothing algorithms...
0.002   Saving counts to counts...

$ cat counts
<s>     1
a       1
#hashtag        1
<s> a   1
a #hashtag      1
#hashtag </s>   1
<s> a #hashtag  1
a #hashtag </s> 1

$ estimate-ngram -counts counts -wl lm.arpa
0.001   Loading counts counts...
estimate-ngram: src/NgramModel.cpp:800: void mitlm::NgramModel::_ComputeBackoffs(): Assertion `allTrue(backoffs != NgramVector::Invalid)' failed.
Aborted (core dumped)

What version of the product are you using? On what operating system?

Built from latest master on github. Ubuntu 14.04.1

mjwillson avatar Feb 11 '15 12:02 mjwillson