htmlpagedom icon indicating copy to clipboard operation
htmlpagedom copied to clipboard

Fix utf-8 encoding problem

Open shtse8 opened this issue 8 years ago • 6 comments

Due to the urgent need in my project, hope it can help. https://github.com/wasinger/htmlpagedom/issues/18

shtse8 avatar Feb 27 '17 19:02 shtse8

I've ran into the same issue myself now, and this fix would be highly apprectiated.

@wasinger Is it possible to get this merged?

ventrec avatar Oct 05 '17 13:10 ventrec

I finally gave up this project and wrote my own html dom parse. But I am new on starting up a open source project.

shtse8 avatar Oct 06 '17 19:10 shtse8

I think it can help you.

curl_setopt($ch, CURLOPT_ENCODING, 'UTF-8');

kukungkung avatar Oct 21 '17 16:10 kukungkung

@kukungkung setting curlopt is just try to request with encoding utf-8, you have to decode the utf8 yourself. and the response may not follow your encoding. Also, mostly site are sending utf8 to you. here, the main problem is, htmlpagedom parse cannot support utf8, but not the curl.

shtse8 avatar Oct 24 '17 21:10 shtse8

the problem is not that utf8 is not parsed, just that result is encoded with html entities.

i solved this in my code:

$html = html_entity_decode((string)$crawler, ENT_NOQUOTES, 'UTF-8');

glensc avatar Jun 11 '18 09:06 glensc

the PRs seems broken because created from @shtse8 master branch, thus changes from https://github.com/wasinger/htmlpagedom/pull/19 and https://github.com/wasinger/htmlpagedom/pull/20 mixed in both pull requests. and perhaps even changes not related to neither of the PRs.

glensc avatar Jun 11 '18 09:06 glensc