vessel icon indicating copy to clipboard operation
vessel copied to clipboard

Should not visit pages that have already been visited

Open abineetds opened this issue 3 years ago • 5 comments

How can I make it not visit the same page multiple times? How can I make it so that it doesn't visit any pages outside of the domain?

abineetds avatar Oct 10 '22 19:10 abineetds

Also when I ran it with a memo, I got an error eventually

.../gems/ruby-3.1.2/gems/ferrum-0.11/lib/ferrum/browser/web_socket.rb:19:in `initialize': Too many open files - socket(2) for "127.0.0.1" port 65073 (Errno::EMFILE)

abineetds avatar Oct 11 '22 05:10 abineetds

Also when I ran it with a memo, I got an error eventually

.../gems/ruby-3.1.2/gems/ferrum-0.11/lib/ferrum/browser/web_socket.rb:19:in `initialize': Too many open files - socket(2) for "127.0.0.1" port 65073 (Errno::EMFILE)

I think you should tune your OS for example for Linux

route avatar Oct 11 '22 07:10 route

As for the issue I have a plan to intro an option for request but unfortunately it won't work for all the websites. So it's going to be very optional.

route avatar Oct 11 '22 07:10 route

Also when I ran it with a memo, I got an error eventually

.../gems/ruby-3.1.2/gems/ferrum-0.11/lib/ferrum/browser/web_socket.rb:19:in `initialize': Too many open files - socket(2) for "127.0.0.1" port 65073 (Errno::EMFILE)

I think you should tune your OS for example for Linux

What is the root cause for this? It seems to me that while opening a TCP Socket connection, ferrum opens a file but never closes it? Shouldn't this not happen since the number of pages being processed at once is at most the number of processors (unless overridden).

abineetds avatar Oct 11 '22 07:10 abineetds

Ferrum opens only one connection per page and closes it when page is processed releasing the page and connection. So something is wrong with the crawler most likely

route avatar Oct 11 '22 07:10 route