Scraper icon indicating copy to clipboard operation
Scraper copied to clipboard

Document Guzzle options for handling errors & timeouts

Open gixxy22 opened this issue 4 years ago • 8 comments

Whether using symfony or simplehtmldom, if a timeout is set, and the page times outs, it throws and exception and everything stops.

is there a way to suppress exceptions?

gixxy22 avatar Aug 24 '21 13:08 gixxy22

The using() method takes a second parameter, where you can specify Guzzle request options (or if you're in PHP-land, pass in your own Guzzle Client):

{% set crawler = craft.scraper.using('symfony', {
  http_errors: false,
  timeout: 10,
}).get('https://zombo.com') %}

or

$crawler = Scraper::getInstance()->scraper->using('symfony', $myTweakedGuzzleClient)->get('https://zombo.com');

I'll add a note about this ā˜šŸ¼ to the readme.

michaelrog avatar Aug 24 '21 15:08 michaelrog

Cheers Michael,

I managed to figure it out earlier :-)

The problem i’m having now however is that if a remote source exceeds the timeout, theres no handling for that and everything halts with an exception. trying to see if i can do something with this.

cheers

mike

On 24 Aug 2021, at 16:21, Michael Rog @.***> wrote:

The using() method takes a second parameter, where you can specify Guzzle request options https://docs.guzzlephp.org/en/6.5/request-options.html (or if you're in PHP-land, pass in your own Guzzle Client):

{% set crawler = craft.scraper.using('symfony', { http_errors: false, timeout: 10, }).get('https://zombo.com') %} or

$crawler = Scraper::getInstance()->scraper->using('symfony', $myTweakedGuzzleClient)->get('https://zombo.com'); I'll add a note about this ā˜šŸ¼ to the docs.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/TopShelfCraft/Scraper/issues/3#issuecomment-904738205, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKZL4JPMXO7XVBFYAUOUJXLT6O2IDANCNFSM5CWZKMLA.

gixxy22 avatar Aug 24 '21 15:08 gixxy22

The http_errors option may help with that. šŸ¤žšŸ¼

michaelrog avatar Aug 24 '21 15:08 michaelrog

unfortunately not, i tried that, seems to only help with 400/500 status codes, but if the server doesn't respond with a code, it hangs.

gixxy22 avatar Aug 24 '21 15:08 gixxy22

Hmmmm... Can you try specifying a shorter timeout duration on your Guzzle client, and disabling http_errors?

The default setting for Guzzle is timeout => 0, i.e. Guzzle will wait indefinitely for the server to return. So you may be bumping into PHP's timeout (or Craft's, or the web server's). What we want is for Guzzle to hit its internal timeout and throw a 408, which http_errors can suppress.

michaelrog avatar Aug 24 '21 15:08 michaelrog

ive tried a timeout of 1, a good example is tesco.com http://tesco.com/, it seems to hang for anything other than standard browsers. I set the timeout to 1 to avoid waiting for sites such as this hanging eveything up.

I have made a try/catch mod on the attached files in order to make a quick fix, maybe something that could be updated on the repo?

cheers

mike

On 24 Aug 2021, at 16:39, Michael Rog @.***> wrote:

Hmmmm... Can you try specifying a shorter timeout duration on your Guzzle client, and disabling http_errors?

The default setting for Guzzle is timeout => 0, i.e. Guzzle will wait indefinitely for the server to return. So you may be bumping into PHP's timeout (or Craft's, or the web server's). What we want is for Guzzle to hit its internal timeout and throw a 408, which http_errors can suppress.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/TopShelfCraft/Scraper/issues/3#issuecomment-904752919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKZL4JLDBLHQIUSD2UNV7S3T6O4MHANCNFSM5CWZKMLA.

gixxy22 avatar Aug 24 '21 15:08 gixxy22

Hi Michael,

sorry to bother you but im trying to use your plugin however im still having problems.

this is my code:

{% set client = {base_uri : 'http://360coupons.com', http_errors : false, allow_redirects : false, timeout: 3} %}

{% set crawler = craft.scraper.using('symfony', client).get(client.base_uri) %}

{% if crawler %} {{ crawler.filter('title').text() }} {% endif %}

however it doesnt appear to be taking any notice of the guzzle options, ive set no redirects but its still getting redirected.

any ideas?

thanks

mike

On 24 Aug 2021, at 16:21, Michael Rog @.***> wrote:

The using() method takes a second parameter, where you can specify Guzzle request options https://docs.guzzlephp.org/en/6.5/request-options.html (or if you're in PHP-land, pass in your own Guzzle Client):

{% set crawler = craft.scraper.using('symfony', { http_errors: false, timeout: 10, }).get('https://zombo.com') %} or

$crawler = Scraper::getInstance()->scraper->using('symfony', $myTweakedGuzzleClient)->get('https://zombo.com'); I'll add a note about this ā˜šŸ¼ to the docs.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/TopShelfCraft/Scraper/issues/3#issuecomment-904738205, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKZL4JPMXO7XVBFYAUOUJXLT6O2IDANCNFSM5CWZKMLA.

gixxy22 avatar Feb 17 '22 13:02 gixxy22

the client options seem to work with simplehtmldom, but not with symfony

gixxy22 avatar Feb 17 '22 15:02 gixxy22