maxlybbert comments

Results 9 comments of


                                            maxlybbert

Status Checkup

If you’re still interested in fixing the “troublesome files,” it sounds like an interesting problem. I’m not aware of any existing tool to autodetect the encoding of a part of...

Status Checkup

I’m sorry I didn’t look at this over the weekend. Even so, you’ve made a lot of progress pretty quickly. Hopefully I’ll be able to do something helpful before you’ve...

I've played around with Perl's [`Encode::Guess`](http://perldoc.perl.org/Encode/Guess.html) module, and the early results are promising. I used the following script, and most of the non-utf8 portions are in the Windows' version of...

Status Checkup

Oops. `utf8::is_utf8` doesn’t do what I thought. The script should be: ~~~ #!/bin/env perl # improved Unicode support starting with 5.14 use v5.14; use warnings; use constant codepages => qw{WinLatin1...

Status Checkup

I re-read the documentation to be sure about whether `$enc->encode($line)` always returns utf-8. It does, with the caveat that `$enc` can be either an object that can convert to utf-8...

Status Checkup

I checked what `$enc` has on error, and it does get an “or”-separated list of candidates. Which is nice, since `Encode::Guess` figures out the encoding only 127 times, compared to...

Status Checkup

I recently discovered that ICU ( http://icu-project.org ) supports encoding detection, so I wrote a short C++ program that detects the encoding, line-by-line, and actually performs the encoding. Unfortunately, some...

Status Checkup

I have some changes I want to make to my C++ program. I think I’m wrong about getting mojibake when I fall back to encoding by UTF-8. Instead, I think...

Status Checkup

I’m currently only going line-by-line. I don’t think it would be hard to process just the de-spammed portion of each file, though.