feat: improve memory performance and custom container classes
This closes #386 and is an extension of #496. This provides new options to store data in memory, in memory and compressed or on the filesystem. Unlikely #496 this would definitely require a major release as it makes significant changes to resource handling.
This PR also restructures a lot of tests and removes fs-extra as it's mostly not needed but also in our larger internal project we found it introduces a lot of compatibility issues for example with memfs which would be the best way of testing filesystem based tests, but this PR was big enough already.
Edit: Last night I completed a scrape of 11 non-English websites in parallel using filesystem caching. No out of memory errors and total exported size of 4.2GB.
Hey @phawxby
Thanks a lot for your PR, I didn't have time to implement such big changes so I'm happy to see your proposal 👍 I definitely need more time to check it, I'll try to provide a review during next 1-2 weeks.
Do you have a way to compare current version vs this changes? Would be nice to have something that proofs this changes decrease memory usage.
Hey @phawxby
Thanks a lot for your PR, I didn't have time to implement such big changes so I'm happy to see your proposal 👍 I definitely need more time to check it, I'll try to provide a review during next 1-2 weeks.
Do you have a way to compare current version vs this changes? Would be nice to have something that proofs this changes decrease memory usage.
I guess we could have some people from #386 fire up this branch and see how they fair.
@s0ph1e do you have any suggestions how to resolve the code climate issues? The limits feel very low to me.
Nevermind, I figured out a way
I tested 3 different temp modes and can confirm that they improve memory usage ✅
To test the modes I created fake endless website with unique images and links on each page and tried to scrape it with script from docker container with memory limited to 64Mb
- with
memorymode script was killed after ~50 resources requested (~70Mb), nothing was saved to fs - with
memory-compressedmode scraper was able to save 142Mb+ to fs - with
filesystemmode scraper was able to save 306Mb+ to fs
Another interesting idea to improve memory usage is to use smaller requestConcurrency. In my tests scripts with requestConcurrency: 10 were killed much faster than scripts with requestConcurrency: 5, which on their turn were killed faster than requestConcurrency: 1. Which makes sense because temporary request data are also saved to memory. Now concurrency is unlimited be default https://github.com/website-scraper/node-website-scraper/blob/34ecd6af4dac885fca27e75a09a40135382a7ca1/lib/config/defaults.js#L61 and I need to think if it should to be changed in next major version
Ahh man, I was crossing my fingers on this PR getting merged. @phawxby Any insight? Or have you branched this repo since it seems Sophie doesn't have time right now to maintain?
Ahh man, I was crossing my fingers on this PR getting merged. @phawxby Any insight? Or have you branched this repo since it seems Sophie doesn't have time right now to maintain?
Sadly not and I'm actually no longer employed with the organisation. Feel free to fork and fix the remaining issues and reopen the PR if you like as I definitely won't have time at the moment. Sorry.
There seems to be no way to get the commits since the repo was deleted. When I pull down this repo and try to cherry-pick the SHAs in your PR, they can't be found. Github seems to have the data somewhere since the PR changes are still visible, but wherever they have the info, it's not in the actual git repo.
Do you have access in any form to the repo?
Thanks @aivus! Resurrected it here: https://github.com/website-scraper/node-website-scraper/pull/526.
I don't have much context to know the changes that have been made recently or the best way to resolve the merge conflicts.