node-website-scraper icon indicating copy to clipboard operation
node-website-scraper copied to clipboard

feat: improve memory performance and custom container classes

Open phawxby opened this issue 3 years ago • 5 comments

This closes #386 and is an extension of #496. This provides new options to store data in memory, in memory and compressed or on the filesystem. Unlikely #496 this would definitely require a major release as it makes significant changes to resource handling.

This PR also restructures a lot of tests and removes fs-extra as it's mostly not needed but also in our larger internal project we found it introduces a lot of compatibility issues for example with memfs which would be the best way of testing filesystem based tests, but this PR was big enough already.

Edit: Last night I completed a scrape of 11 non-English websites in parallel using filesystem caching. No out of memory errors and total exported size of 4.2GB.

phawxby avatar Jun 21 '22 14:06 phawxby

Hey @phawxby

Thanks a lot for your PR, I didn't have time to implement such big changes so I'm happy to see your proposal 👍 I definitely need more time to check it, I'll try to provide a review during next 1-2 weeks.

Do you have a way to compare current version vs this changes? Would be nice to have something that proofs this changes decrease memory usage.

s0ph1e avatar Jun 22 '22 13:06 s0ph1e

Hey @phawxby

Thanks a lot for your PR, I didn't have time to implement such big changes so I'm happy to see your proposal 👍 I definitely need more time to check it, I'll try to provide a review during next 1-2 weeks.

Do you have a way to compare current version vs this changes? Would be nice to have something that proofs this changes decrease memory usage.

I guess we could have some people from #386 fire up this branch and see how they fair.

phawxby avatar Jun 22 '22 13:06 phawxby

@s0ph1e do you have any suggestions how to resolve the code climate issues? The limits feel very low to me.

phawxby avatar Jun 23 '22 14:06 phawxby

Nevermind, I figured out a way

phawxby avatar Jun 23 '22 14:06 phawxby

I tested 3 different temp modes and can confirm that they improve memory usage ✅

To test the modes I created fake endless website with unique images and links on each page and tried to scrape it with script from docker container with memory limited to 64Mb

  • with memory mode script was killed after ~50 resources requested (~70Mb), nothing was saved to fs
  • with memory-compressed mode scraper was able to save 142Mb+ to fs
  • with filesystem mode scraper was able to save 306Mb+ to fs

Another interesting idea to improve memory usage is to use smaller requestConcurrency. In my tests scripts with requestConcurrency: 10 were killed much faster than scripts with requestConcurrency: 5, which on their turn were killed faster than requestConcurrency: 1. Which makes sense because temporary request data are also saved to memory. Now concurrency is unlimited be default https://github.com/website-scraper/node-website-scraper/blob/34ecd6af4dac885fca27e75a09a40135382a7ca1/lib/config/defaults.js#L61 and I need to think if it should to be changed in next major version

s0ph1e avatar Jul 28 '22 12:07 s0ph1e

Ahh man, I was crossing my fingers on this PR getting merged. @phawxby Any insight? Or have you branched this repo since it seems Sophie doesn't have time right now to maintain?

scottmas avatar Apr 07 '23 16:04 scottmas

Ahh man, I was crossing my fingers on this PR getting merged. @phawxby Any insight? Or have you branched this repo since it seems Sophie doesn't have time right now to maintain?

Sadly not and I'm actually no longer employed with the organisation. Feel free to fork and fix the remaining issues and reopen the PR if you like as I definitely won't have time at the moment. Sorry.

phawxby avatar Apr 07 '23 16:04 phawxby

There seems to be no way to get the commits since the repo was deleted. When I pull down this repo and try to cherry-pick the SHAs in your PR, they can't be found. Github seems to have the data somewhere since the PR changes are still visible, but wherever they have the info, it's not in the actual git repo.

Do you have access in any form to the repo?

scottmas avatar Apr 07 '23 18:04 scottmas

Hi @scottmas

You can follow this to checkout the commits

aivus avatar Apr 07 '23 19:04 aivus

Thanks @aivus! Resurrected it here: https://github.com/website-scraper/node-website-scraper/pull/526.

I don't have much context to know the changes that have been made recently or the best way to resolve the merge conflicts.

scottmas avatar Apr 08 '23 21:04 scottmas