codeprep icon indicating copy to clipboard operation
codeprep copied to clipboard

A toolkit for pre-processing large source code corpora

Results 8 codeprep issues
Sort by recently updated
recently updated
newest added

Bumps [joblib](https://github.com/joblib/joblib) from 0.15.1 to 1.2.0. Changelog Sourced from joblib's changelog. Release 1.2.0 Fix a security issue where eval(pre_dispatch) could potentially run arbitrary code. Now only basic numerics are supported....

dependencies

Bumps [pygments](https://github.com/pygments/pygments) from 2.6.1 to 2.7.4. Release notes Sourced from pygments's releases. 2.7.4 Updated lexers: Apache configurations: Improve handling of malformed tags (#1656) CSS: Add support for variables (#1633, #1666)...

dependencies

It seems that it dosenot work when delt with javascript language. And is there any solution to remove end of a token'\t' in the token sequence.

bug
question

https://github.com/giganticode/codeprep/blob/f5a35b68fab930e095a99dbd83e27f63c23552a4/codeprep/pipeline/to_repr.py#L60 eh, I am working with this repository. on windows I find when I use unicode like chinese in path like "./文档/", to_repr.py is likely to encode this string to...

* Rename `PreprocessingMetadata` -> `PreppedTokenMetadata` * Represent `word_boundaries` field as a list of the number of subtoken in each token, e.g [1, 3, 1, 2] instead of [0, 1, 4,...

enhancement

Currently: ```python >>> api.basic("getName") ['', 'get', 'Name', ''] ``` To be done: ```python >>> api.basic("getName") ['get', 'Name', ''] ```

enhancement

The tasks for the new `PreppedTokenSequence` class are to encapsulate getting full tokens from subtokens (which is currently done by `FullTokenIterator` class) and at the same time provide transparent access...

enhancement

* rename `SplitContainer` to Identifier * make Identifier abstract and extend it with `SingleWordIdentifier`, `TwoWordIdentifier`, `ThreeWordIdentifier`, `FourOrMoreWordIdentifier` * make other classes that have sub-classes abstract

enhancement