Suggestion: middle-pgsql stats for hit/miss
If any node in a way is not in the cache, then the cost of the local_nodes_get_list becomes the cost of a database access.
The difference between a db access for 1 node vs 10 nodes is low, but 0 nodes vs 1 nodes is large. Effectively this means a 90% hit rate is a 0% hit rate, making the stats less meaningful.
As an alternative way of looking at this, I suggest having middle_pgsql_t keep track of "entire lookup satisfied by cache" vs "entire lookup not satisfied by cache", as I believe it's more meaningful. This wouldn't change anything with the current stats.
Thoughts?
As a practical difference, with some hacked up code, I came up with the following when importing New South Wales, with various cache sizes:
- 512MB: 105 seconds, 99.97% cache, 238 db hits, 1492504 avoids, 99.98%
- 256MB: 131 seconds, 95.24% cache, 96358 db hits, 1396384 avoids, 93.54%
- 128MB: 327 seconds, 52.91% cache, 896566 db hits, 596176 avoids, 60.06%
- no cache: 399 seconds, 0% cache, 1492742 db hits, 0 avoids, 0%
(99.97% is as high as it can go, because there are referenced nodes which don't exist in the extract)