[ENG-561] GUID Query Optimizations
I would love to hear suggestions of more avenues I can explore for optimization.
Purpose
Make changes to ensure that retrieval of Guids is optimal and doesn't make unnecessary calls to the database
Changes
- Modified the access of a user's guid to use the cached property rather than query the database
Most instances (excepting the changes I made) of .guids being referenced rather than _id were for objects that used different models for their _id field or a combination of the guid id and another id. I replaced a single instance of .guids.first()._id with ._id. I have listed some of these below so others may see instances that I looked into but didn't think were appropriate to change.
- /osf/metadata/serializers.py
- Can’t be changed because a Files
_idis inherited by ObjectIDMixin (as opposed to OptionalGuidMixin)
- Can’t be changed because a Files
- /osf/models/collection.py
-
.guids.first()can’t be changed because it’s being used to filterCollectionSubmissionobjects which don’t solely use the guid id as_id
-
- /osf/models/files.py
- Can’t be changed because a Files
_idis inherited by ObjectIDMixin (as opposed to OptionalGuidMixin)
- Can’t be changed because a Files
- /addons/wiki/models.py
- Can’t be changed because
Comment.root_targetmust be filtered on the Guid object of theWikiPage, not theWikiPage._id.
- Can’t be changed because
QA Notes
Tested the change in the Admin app locally and it looks fine. QA ensuring the functionality of user_search in the Admin app would be useful.
Documentation
N/A
Side Effects
N/A
Ticket
https://openscience.atlassian.net/browse/ENG-561
I think you can improve the cached query here with something like
@cached_property
def _id(self):
try:
guid = self.guids.values_list('_id', flat=True).first()
except IndexError:
return None
Which is:
SELECT "osf_guid"."_id"
FROM "osf_guid"
WHERE ("osf_guid"."object_id" = 357137 AND "osf_guid"."content_type_id" = 28)
ORDER BY "osf_guid"."created" DESC
LIMIT 1
Vs the original query of:
SELECT "osf_guid"."modified",
"osf_guid"."id",
"osf_guid"."_id",
"osf_guid"."content_type_id",
"osf_guid"."object_id",
"osf_guid"."created"
FROM "osf_guid"
WHERE ("osf_guid"."object_id" = 357141 AND "osf_guid"."content_type_id" = 28)
ORDER BY "osf_guid"."created" DESC
LIMIT 1
AFTER
Benchmarking this I find the times marginally improved, I was using inv shell --print-sql
Also is this line still being used? it should be removed if not.
I think the query you provided is more efficient but it looks like it doesn't cache properly. So when I run uday = OSFUser.objects.get(username='[email protected]') followed by uday._id in the shell, a query to the database is run by uday._id.
I'm not 100% sure why but I'm seeing different results on uday.__dict__._prefetched_objects_cache depending on the implementation. With the original implementation, it looks like:
'_prefetched_objects_cache': {'guids': <QuerySet [<id:nzm8g, referent:(<OSFUser(u'[email protected]') with guid u'nzm8g'>)>]>},
and with your suggested implementation it looks like:
'_prefetched_objects_cache': {'guids': <QuerySet [<id:nzm8g, referent:(<OSFUser(u'[email protected]') with guid None>)>]>},.
It doesn't look like guids aren't getting cached on any objects with the values_list implementation.
I do think that if we can get your implementation to cache properly, we should use it.
For :
'_prefetched_objects_cache': {'guids': <QuerySet [<id:nzm8g, referent:(<OSFUser(u'[email protected]') with guid None>)>]>}
it looks like <OSFUser(u'[email protected]') isn't saved because it's guid is none, can you save it and refresh the cache? I'm basically just looking at .__id_cache that seems to be working.
Sorry, I copied+pasted your code and didn't acknowledge that I was trying to run things without returning the guid value. It works as expected and should help at least a little!
Tests are failing because the number of queries exceeds the number specified in specific test's assert statements.
After looking into this a bit more, I believe that the values_list query that John proposed is more efficient in isolation but is less efficient within the context of the GuidMixin.
When retrieving classes that inherit from the GuidMixin (i.e. OSFUser.objects.all()), the guids of that object are 'included' and retrieved as well. The existing _id property took advantage of this because cls.guids.first()._id did not need to make any additional calls to the database. The use of values_list in the _id property requires a call to the database the first time _id is run (after that, _id is cached properly) because the use of values_list initiates a new queryset. As a result, values_list leads to many more additional calls to the database and shouldn't be used in this case (it is an example of an 'n+1 query' which I hadn't heard of before).