codeprep icon indicating copy to clipboard operation
codeprep copied to clipboard

PreprocessingMetadata enhancement

Open hlibbabii opened this issue 5 years ago • 1 comments

  • Rename PreprocessingMetadata -> PreppedTokenMetadata
  • Represent word_boundaries field as a list of the number of subtoken in each token, e.g [1, 3, 1, 2] instead of [0, 1, 4, 5, 7]
  • Remove non-processible tokens filed. Return non-processible tokens as a separate object
  • Provide a method for returning the metadata for the last tokens:
>>> metadata.for_last_tokens(n: int)

hlibbabii avatar Feb 27 '20 13:02 hlibbabii

This enhancement is useful for easier implementation of the calculation of the context statistics in giganticode-langmodels

hlibbabii avatar Feb 27 '20 13:02 hlibbabii