Fix utf-8 encoding problem
Due to the urgent need in my project, hope it can help. https://github.com/wasinger/htmlpagedom/issues/18
I've ran into the same issue myself now, and this fix would be highly apprectiated.
@wasinger Is it possible to get this merged?
I finally gave up this project and wrote my own html dom parse. But I am new on starting up a open source project.
I think it can help you.
curl_setopt($ch, CURLOPT_ENCODING, 'UTF-8');
@kukungkung setting curlopt is just try to request with encoding utf-8, you have to decode the utf8 yourself. and the response may not follow your encoding. Also, mostly site are sending utf8 to you. here, the main problem is, htmlpagedom parse cannot support utf8, but not the curl.
the problem is not that utf8 is not parsed, just that result is encoded with html entities.
i solved this in my code:
$html = html_entity_decode((string)$crawler, ENT_NOQUOTES, 'UTF-8');
the PRs seems broken because created from @shtse8 master branch, thus changes from https://github.com/wasinger/htmlpagedom/pull/19 and https://github.com/wasinger/htmlpagedom/pull/20 mixed in both pull requests. and perhaps even changes not related to neither of the PRs.