node-website-scraper icon indicating copy to clipboard operation
node-website-scraper copied to clipboard

feat: add parsing of response body for encoding

Open phawxby opened this issue 3 years ago • 5 comments

Quickly threw this together, it should work in theory and close #500.

This is my last day before vacation so any additional work to get it merged in will need to be picked up by someone else. @Jeremytijal ?

Changes:

phawxby avatar Jul 01 '22 09:07 phawxby

Currently on holiday so can't address changes at the moment but yes that makes sense.

Regarding the package-lock, how about the nightly is just updated to install with --package-lock=false? You get install consistency with nightly checks. https://docs.npmjs.com/cli/v8/using-npm/config#package-lock

phawxby avatar Jul 11 '22 07:07 phawxby

Yeah this looks cool so for example the site html I'm scraping looks like this:

image

and I end up with:

image

Which is the difference between:

image

And this:

image

Which I'm assuming this would fix?

marcfielding1 avatar Jul 14 '22 13:07 marcfielding1

Of course I could just pull the branch and check, duh...

marcfielding1 avatar Jul 14 '22 13:07 marcfielding1

@phawxby yes, --package-lock=false may work.

@marcfielding1 we expect some encoding issues to be fixed by this PR, especially when encoding is set inside html file in tag. But it would be nice if you check whether this branch fixes an issue for you

s0ph1e avatar Jul 14 '22 14:07 s0ph1e

FYI, I scrapped https://tonclubtonmaillot.groupama.fr (I'm hostingthe result in [1]) from this PR, but go "�" instead of "à" in index page. I'll try to troubleshoot it when I have a chance

[1] https://test-node-website-scraper.netlify.app/

joelcapitao avatar Aug 01 '22 14:08 joelcapitao

sorry @phawxby it's out of my skills 😕

Jeremytijal avatar Aug 16 '22 15:08 Jeremytijal

I'm closing this PR because similar changes were merged in #504 and will be released in the next version in the next 1-2 days

s0ph1e avatar Aug 29 '22 19:08 s0ph1e