codeprep
codeprep copied to clipboard
PreprocessingMetadata enhancement
- Rename
PreprocessingMetadata->PreppedTokenMetadata - Represent
word_boundariesfield as a list of the number of subtoken in each token, e.g [1, 3, 1, 2] instead of [0, 1, 4, 5, 7] - Remove
non-processibletokens filed. Return non-processible tokens as a separate object - Provide a method for returning the metadata for the last tokens:
>>> metadata.for_last_tokens(n: int)
This enhancement is useful for easier implementation of the calculation of the context statistics in giganticode-langmodels