boilerpipe3 icon indicating copy to clipboard operation
boilerpipe3 copied to clipboard

UnicodeDecodeError

Open edoost opened this issue 7 years ago • 0 comments

Hi,

When I try to extract an article from varzesh3.com (for example https://www.varzesh3.com/news/1554055/) I get this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'your_url' is not defined
>>> your_url = 'https://www.varzesh3.com/news/1554055/'
>>> extractor = Extractor(extractor='ArticleExtractor', url=your_url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/boilerpipe/extract/__init__.py", line 46, in __init__
    self.data = str(self.data, encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I solved this by replacing line 46 with: self.data = self.data.decode(encoding, "ignore")

edoost avatar Sep 15 '18 18:09 edoost