crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

add "exclude" property to enqueueLinksByClickingElements like "enqueueLinks"

Open AraCoders opened this issue 2 years ago • 1 comments

Which package is the feature request for? If unsure which one to select, leave blank

None

Feature

PlaywrightCrawler enqueueLinks has two properties: "regexps" and "exclude". however only "regexps" is present for "enqueueLinksByClickingElements".

Motivation

consistency between enqueueLinks and enqueueLinksByClickingElements. because i have a scope to crawl using "regexps" but frequently i need to filter some urls (add them to a blacklist). so for enqueueLinks it's easy. but for enqueueLinksByClickingElements i had to provide 2 regexp: one for the normal scope of the crawler, the other is a negative lookbhind regex to filter some of the urls, however i think it's still not working as expected, because some urls get filtered from the first regex, but still make it to enqueued requests because of the second negative lookbehind regex.

Ideal solution or implementation, and any additional constraints

add the property exclude to "enqueueLinksByClickingElements". and also make it clear in the docs wether the list of regex supplied to "regexps" property should work in a "and" or "or" relationship. same thing for relationship between "regexps" and "exclude" when they are both supplied.

Alternative solutions or implementations

No response

Other context

No response

AraCoders avatar Jan 22 '24 20:01 AraCoders

Hey! I noticed that this function also lacks the "limit" option provided by "enqueueLinks" which limits the amount of enqueuedLinks. I opened a separate issue for it #2568 but closed it since i think it also belongs here.

AraCoders avatar Jul 06 '24 03:07 AraCoders