corpkit icon indicating copy to clipboard operation
corpkit copied to clipboard

make_corpus fails with UnicodeDecodeError/TypeError

Open alischinsky opened this issue 9 years ago • 2 comments

make_corpus() fails when chunking UTF-8 files while parsing. There may be a "decode('utf-8')" missing somewhere.

This is true both in Python2 (log) and Python3 (log).

alischinsky avatar Aug 12 '16 15:08 alischinsky

Texts are opened through saferead() in corpkit/process.py (line 861)


def saferead(path):
    """
    Read a file with detect encoding
    :returns: text and its encoding
    """
    import chardet
    import sys
    if sys.version_info.major == 3:
        enc = 'utf-8'
        with open(path, 'r', encoding=enc) as fo:
            data = fo.read()
        return data, enc
    else:
        with open(path, 'r') as fo:
            data = fo.read()
        try:
            enc = 'utf-8'
            data = data.decode(enc)
        except UnicodeDecodeError:
            enc = chardet.detect(data)['encoding']
            data = data.decode(enc, errors='ignore')
        return data, enc

I'll be able to get around to these at some point hopefully, but feel free to submit a PR as well! :)

interrogator avatar Aug 12 '16 15:08 interrogator

I tried a quick fix, but it wasn't really much, in f36ac38. Then I tried to reproduce the error, and couldn't. If your data isn't particularly sacred, could I get a copy and try it out? (Or as I said, feel free to submit a PR yourself)

interrogator avatar Aug 12 '16 16:08 interrogator