maxlybbert

Results 9 comments of maxlybbert

If you’re still interested in fixing the “troublesome files,” it sounds like an interesting problem. I’m not aware of any existing tool to autodetect the encoding of a part of...

I’m sorry I didn’t look at this over the weekend. Even so, you’ve made a lot of progress pretty quickly. Hopefully I’ll be able to do something helpful before you’ve...

I've played around with Perl's [`Encode::Guess`](http://perldoc.perl.org/Encode/Guess.html) module, and the early results are promising. I used the following script, and most of the non-utf8 portions are in the Windows' version of...

Oops. `utf8::is_utf8` doesn’t do what I thought. The script should be: ~~~ #!/bin/env perl # improved Unicode support starting with 5.14 use v5.14; use warnings; use constant codepages => qw{WinLatin1...

I re-read the documentation to be sure about whether `$enc->encode($line)` always returns utf-8. It does, with the caveat that `$enc` can be either an object that can convert to utf-8...

I checked what `$enc` has on error, and it does get an “or”-separated list of candidates. Which is nice, since `Encode::Guess` figures out the encoding only 127 times, compared to...

I recently discovered that ICU ( http://icu-project.org ) supports encoding detection, so I wrote a short C++ program that detects the encoding, line-by-line, and actually performs the encoding. Unfortunately, some...

I have some changes I want to make to my C++ program. I think I’m wrong about getting mojibake when I fall back to encoding by UTF-8. Instead, I think...

I’m currently only going line-by-line. I don’t think it would be hard to process just the de-spammed portion of each file, though.