pycaching icon indicating copy to clipboard operation
pycaching copied to clipboard

Parse cache data from different page

Open tomasbedrich opened this issue 10 years ago • 8 comments

Use this URL: http://www.geocaching.com/seek/cdpf.aspx?guid=182a3463-e46e-4401-8697-3ad3ac2a1a42&lc=10 to parse geocache data (possible 2 times speedup).

tomasbedrich avatar Jul 22 '15 09:07 tomasbedrich

Do you have any clue how to retrieve the guid of a cache in the first place without loading the usual details page? Wouldn't make sense if one has to parse the details page first :laughing:

weinshec avatar Sep 01 '16 17:09 weinshec

Using the log page provides a link to the listing using the guid of the cache: https://www.geocaching.com/seek/log.aspx?wp=GC3RPVZ

FriedrichFroebel avatar Sep 01 '16 17:09 FriedrichFroebel

It is also possible to fetch the GUID using the load_quick() method:

GET https://tiles01.geocaching.com/map.details?i=GC4HTZW
{  
   "status":"success",
   "data":[  
      {  
         "name":"Zmijovec (Amorphophallus)",
         "gc":"GC3RPVZ",
         "g":"182a3463-e46e-4401-8697-3ad3ac2a1a42",
         "available":true,
         "archived":false,
         "subrOnly":false,
         "li":false,
         "fp":"115",
         "difficulty":{  
            "text":4.5,
            "value":"4_5"
         },
         "terrain":{  
            "text":1.0,
            "value":"1"
         },
         "hidden":"11/18/2012",
         "container":{  
            "text":"Regular",
            "value":"regular.gif"
         },
         "type":{  
            "text":"Traditional Cache",
            "value":2
         },
         "owner":{  
            "text":"Lindbergh007",
            "value":"850af23d-fc83-4d3c-b93a-9e5d6ae359c9"
         }
      }
   ]
}

The question is, if it will be faster to make two lightweight requests or one heavy.

I would suggest adding a GUID parsing to the load_quick() and creating a temporary load_by_guid() method, which would populate some basic Cache info using the print page. Then we could measure what is faster.

tomasbedrich avatar Sep 04 '16 11:09 tomasbedrich

That sounds reasonable to me. I would like to give it try and will report about the performance comparison as soon as I have a first implementation.

weinshec avatar Sep 04 '16 22:09 weinshec

So here are some numbers... I used the timeit module to profile the call to Cache.load() and compared it to a call of Cache.load_quick() parsing the guid and subsequently calling Cache.load_by_guid() requesting the print-page. The result is based on the mean time averaging over 100 calls each.

Scenario 1 (load()): 1.28 seconds Scenario 2 (load_quick() + load_by_guid()): 0.85 seconds

So it seems that two lightweight calls are faster than one heavy, although it's not a factor of 2. Should we go for scenario 2 and rely on two requests or stick to scenario 1 with a single request?

weinshec avatar Sep 09 '16 15:09 weinshec

Nice! I think we should do some refactoring before replacing the original load() method. A big picture is to have multiple load_by_xxx() methods which would actually fill the data and a lightweight load() method which would decide which one to use.

But for now, the best you can do, is to create a pull request for a separate load_by_guid() method which would check a presence of a GUID first (possibly calling load_quick() if needed) and then scrape as much cache details as possible.

Then I would do the refactoring on my own, because it may be a little more complex.

tomasbedrich avatar Sep 09 '16 17:09 tomasbedrich

Following discussion with @twlare from #75:

First of all, there are some gotchas regarding completeness of loaded attributes. The mentioned refactoring may help as it would allow end-users to control what is important for them therefore missing attributes wouldn't mind. Also, it may be good to check whether something hasn't changed on cache "print page" (some new/removed attributes there).

Please feel free to continue to work on https://github.com/tomasbedrich/pycaching/pull/74, but it will need some rebasing on actual master.

So in summary, what is left to do: check status of the code (working?, new/removed attributes), rebase, maybe refactor and the most important step is to switch primary algorithm behind load method (which must be backwards-compatible = the same API + load the same set of attributes).

tomasbedrich avatar Nov 25 '18 22:11 tomasbedrich

Resurrecting the thread by an email received from Dave:


I have one possible suggestion: when I have scraped in the past I have found that the most reliable way to get the cache info is to request the gpx file, which can be obtained through a very simple POST request.

s = requests.Session() cookies = [s.cookies.set(c['name'], c['value']) for c in request_cookies_browser] URL = 'https://www.geocaching.com/geocache/GC9EFK2_colorful-pairs' params = {'__EVENTTARGET':'', '__EVENTARGUMENT':'', 'ctl00$ContentBody$lnkGPXDownload': 'GPX+file'} response = s.post(URL, params) GPX = response.text

I don't know if it works for non-premium members, but it is very fast and contains almost everything you want, including (usually) about 10 logs. It could speed up the get_cache() function a lot.

tomasbedrich avatar Aug 20 '21 22:08 tomasbedrich