node-website-scraper feat: add parsing of response body for encoding

Quickly threw this together, it should work in theory and close #500.

This is my last day before vacation so any additional work to get it merged in will need to be picked up by someone else. @Jeremytijal ?

Changes:

Adds parsing of response body to look for charset
Remove package-lock.json from .gitignore
Increases ecmaversion in eslint to allow for optional chaining, it's an easy way to reduce cognitive complexity. It's supported in Node 14 and above

Jul 01 '22 09:07 phawxby

Currently on holiday so can't address changes at the moment but yes that makes sense.

Regarding the package-lock, how about the nightly is just updated to install with --package-lock=false? You get install consistency with nightly checks. https://docs.npmjs.com/cli/v8/using-npm/config#package-lock

Jul 11 '22 07:07 phawxby

Yeah this looks cool so for example the site html I'm scraping looks like this:

and I end up with:

Which is the difference between:

And this:

Which I'm assuming this would fix?

Jul 14 '22 13:07 marcfielding1

Of course I could just pull the branch and check, duh...

Jul 14 '22 13:07 marcfielding1

@phawxby yes, --package-lock=false may work.

@marcfielding1 we expect some encoding issues to be fixed by this PR, especially when encoding is set inside html file in tag. But it would be nice if you check whether this branch fixes an issue for you

Jul 14 '22 14:07 s0ph1e

FYI, I scrapped https://tonclubtonmaillot.groupama.fr (I'm hostingthe result in [1]) from this PR, but go "�" instead of "à" in index page. I'll try to troubleshoot it when I have a chance

[1] https://test-node-website-scraper.netlify.app/

Aug 01 '22 14:08 joelcapitao

sorry @phawxby it's out of my skills 😕

Aug 16 '22 15:08 Jeremytijal

I'm closing this PR because similar changes were merged in #504 and will be released in the next version in the next 1-2 days

Aug 29 '22 19:08 s0ph1e