crawler icon indicating copy to clipboard operation
crawler copied to clipboard

keep() instead of addToResult() and sub crawlers

Open otsch opened this issue 1 year ago • 0 comments

New methods Step::keep(), Step::keepAs(), Step::keepFromInput() and Step::keepInputAs() as simpler alternatives for Step::addToResult(), Step::addLaterToResult() and Step::keepInputData() which are all deprecated now. The new keep methods add data to a keep array in IO objects. Not creating a Result object and potentially sharing the same Result object for a lot of child outputs, makes the new keep functionality less complex. No need for something like addLaterToResult(). Kept properties can also be used with useInputKey() which is pretty handy.

Another cool new feature are sub crawlers. Any step can now create a sub crawler to fill a property. Example: you have a page about an author with multiple links to detail pages about his books. You can select those links and let a sub crawler fill the author's books property with data from the book detail pages.

Further also introduce a new Step::outputType() method, that returns if a certain step yields outputs that are associate arrays (or objects), scalar values or potentially both (mixed). This helps reduce potential critical problems during a crawler run by validating before the run and throwing an exception (or log error messages).

otsch avatar Mar 26 '24 17:03 otsch