Parse cache data from different page
Use this URL: http://www.geocaching.com/seek/cdpf.aspx?guid=182a3463-e46e-4401-8697-3ad3ac2a1a42&lc=10 to parse geocache data (possible 2 times speedup).
Do you have any clue how to retrieve the guid of a cache in the first place without loading the usual details page? Wouldn't make sense if one has to parse the details page first :laughing:
Using the log page provides a link to the listing using the guid of the cache: https://www.geocaching.com/seek/log.aspx?wp=GC3RPVZ
It is also possible to fetch the GUID using the load_quick() method:
GET https://tiles01.geocaching.com/map.details?i=GC4HTZW
{
"status":"success",
"data":[
{
"name":"Zmijovec (Amorphophallus)",
"gc":"GC3RPVZ",
"g":"182a3463-e46e-4401-8697-3ad3ac2a1a42",
"available":true,
"archived":false,
"subrOnly":false,
"li":false,
"fp":"115",
"difficulty":{
"text":4.5,
"value":"4_5"
},
"terrain":{
"text":1.0,
"value":"1"
},
"hidden":"11/18/2012",
"container":{
"text":"Regular",
"value":"regular.gif"
},
"type":{
"text":"Traditional Cache",
"value":2
},
"owner":{
"text":"Lindbergh007",
"value":"850af23d-fc83-4d3c-b93a-9e5d6ae359c9"
}
}
]
}
The question is, if it will be faster to make two lightweight requests or one heavy.
I would suggest adding a GUID parsing to the load_quick() and creating a temporary load_by_guid() method, which would populate some basic Cache info using the print page. Then we could measure what is faster.
That sounds reasonable to me. I would like to give it try and will report about the performance comparison as soon as I have a first implementation.
So here are some numbers... I used the timeit module to profile the call to Cache.load() and compared it to a call of Cache.load_quick() parsing the guid and subsequently calling Cache.load_by_guid() requesting the print-page. The result is based on the mean time averaging over 100 calls each.
Scenario 1 (load()): 1.28 seconds
Scenario 2 (load_quick() + load_by_guid()): 0.85 seconds
So it seems that two lightweight calls are faster than one heavy, although it's not a factor of 2. Should we go for scenario 2 and rely on two requests or stick to scenario 1 with a single request?
Nice! I think we should do some refactoring before replacing the original load() method. A big picture is to have multiple load_by_xxx() methods which would actually fill the data and a lightweight load() method which would decide which one to use.
But for now, the best you can do, is to create a pull request for a separate load_by_guid() method which would check a presence of a GUID first (possibly calling load_quick() if needed) and then scrape as much cache details as possible.
Then I would do the refactoring on my own, because it may be a little more complex.
Following discussion with @twlare from #75:
First of all, there are some gotchas regarding completeness of loaded attributes. The mentioned refactoring may help as it would allow end-users to control what is important for them therefore missing attributes wouldn't mind. Also, it may be good to check whether something hasn't changed on cache "print page" (some new/removed attributes there).
Please feel free to continue to work on https://github.com/tomasbedrich/pycaching/pull/74, but it will need some rebasing on actual master.
So in summary, what is left to do: check status of the code (working?, new/removed attributes), rebase, maybe refactor and the most important step is to switch primary algorithm behind load method (which must be backwards-compatible = the same API + load the same set of attributes).
Resurrecting the thread by an email received from Dave:
I have one possible suggestion: when I have scraped in the past I have found that the most reliable way to get the cache info is to request the gpx file, which can be obtained through a very simple POST request.
s = requests.Session() cookies = [s.cookies.set(c['name'], c['value']) for c in request_cookies_browser] URL = 'https://www.geocaching.com/geocache/GC9EFK2_colorful-pairs' params = {'__EVENTTARGET':'', '__EVENTARGUMENT':'', 'ctl00$ContentBody$lnkGPXDownload': 'GPX+file'} response = s.post(URL, params) GPX = response.text
I don't know if it works for non-premium members, but it is very fast and contains almost everything you want, including (usually) about 10 logs. It could speed up the get_cache() function a lot.