corpkit icon indicating copy to clipboard operation
corpkit copied to clipboard

ValueError: Invalid control character at: line 1120 column 21 (char 28474)

Open aliabbasjp opened this issue 9 years ago • 2 comments

Follwing error

17:42:39: Parsing finished. Moving parsed files into place ...
Traceback (most recent call last):
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/env.py", line 2168, in interpreter
    out = run_command(tokens)  
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/env.py", line 1113, in run_command
    out = command(tokens[1:])
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/env.py", line 1437, in parse_corpus
    parsed = to_parse.parse(**kwargs)  
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/corpus.py", line 930, in parse
    **kwargs
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/make.py", line 356, in make_corpus
    coref=coref, metadata=metadata)
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/conll.py", line 1113, in convert_json_to_conll
    data = json.load(fo)
  File "/home/d/anaconda2/lib/python2.7/json/__init__.py", line 291, in load
    **kw)
  File "/home/d/anaconda2/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/home/d/anaconda2/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/d/anaconda2/lib/python2.7/json/decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Invalid control character at: line 1120 column 21 (char 28474)

aliabbasjp avatar Oct 07 '16 12:10 aliabbasjp

Thanks for these reports. This is a weird one---the json output of the CoreNLP parser cannot be understood by Python's json module. So, the problem is not really on corpkit's side, but CoreNLP's.

Similar bugs have been reported to CoreNLP: https://github.com/stanfordnlp/CoreNLP/issues/241

I'm guessing that it relates to the encoding in your text files. Would you be able to zip and upload the files in the unparsed/parsed versions of the corpus? This would help me diagnose the problem and make a fix.

interrogator avatar Oct 07 '16 12:10 interrogator

Also, I'd recommend encoding your text files as UTF-8---that should fix this problem in your case. Or, as per the instructions on the issue linked above, update the CoreNLP installed to the GitHub version. If corpkit installed CoreNLP for you, it should be in your ~/corenlp directory.

interrogator avatar Oct 07 '16 12:10 interrogator