PreprocessingMetadata enhancement

Open hlibbabii opened this issue 5 years ago • 1 comments

Rename PreprocessingMetadata -> PreppedTokenMetadata
Represent word_boundaries field as a list of the number of subtoken in each token, e.g [1, 3, 1, 2] instead of [0, 1, 4, 5, 7]
Remove non-processible tokens filed. Return non-processible tokens as a separate object
Provide a method for returning the metadata for the last tokens:

>>> metadata.for_last_tokens(n: int)

Feb 27 '20 13:02 hlibbabii

This enhancement is useful for easier implementation of the calculation of the context statistics in giganticode-langmodels

Feb 27 '20 13:02 hlibbabii