Bugfix/issue 37 fix with non ascii

Open ePaul opened this issue 10 years ago • 0 comments

This is a try to fix https://github.com/hjacobs/codevalidator/issues/37.

I guess it needs some more documentation and some tests, but before I do that, I wanted to show what way I did go. (So please don't merge this yet. I also need to add a better commit comment.)

The problem was that some fix functions had dst.write(fixed.decode()) and later fix_file had fd.write(fixed.encode()), which use the system's default encoding, which in my system seems to be ASCII. Also, this way we got possibly multiple decode steps (one for each problem), and if we have after one of the decode steps a Jalopy run or similar, it gets even more unclear what's happening.

Basically the idea of my fix is that for each file-type, there is one rule which indicates the encoding. It is named after one of the encodings supported by the codecs package. (In the default configuration, this is just utf-8 (for most file types) and ascii (for .properties), there should be some test-cases for latin-1/iso-8859-1, at least, too.) (The function get_encoding_rule(rules) simply iterates over the rule names and passes each one to codecs.lookup, until one of them is successful. For this reason I needed to reorder the rules for the *.properties entry in the default config.)

That encoding is then (other than for the check that one actually can decode the file using this encoding) used for those fix functions which actually work on the content of the file by themselves and don't just call external programs (like Jalopy, pythontidy, etc.). For those methods (for now just _fix_notabs, _fix_nocr and _fix_notrailingws) I invented a decorator @needs_unicode, which wraps the function call into codecs.EncodedFile.

I needed to store the used encoding in a global dict ENCODING_BY_FILE, analogously to the VALIDATION_ERRORS list – please propose a better way to do this, if this is not the way to go.

I noted that for the added functionality it would be useful to have some automated tests (i.e. a bundle of config file, command line arguments, input file and expected output/expected fixed file). Is there some standard python way of doing this?

Aug 02 '15 20:08 ePaul