osf.io [ENG-561] GUID Query Optimizations

I would love to hear suggestions of more avenues I can explore for optimization.

Purpose

Make changes to ensure that retrieval of Guids is optimal and doesn't make unnecessary calls to the database

Changes

Modified the access of a user's guid to use the cached property rather than query the database

Most instances (excepting the changes I made) of .guids being referenced rather than _id were for objects that used different models for their _id field or a combination of the guid id and another id. I replaced a single instance of .guids.first()._id with ._id. I have listed some of these below so others may see instances that I looked into but didn't think were appropriate to change.

/osf/metadata/serializers.py
- Can’t be changed because a Files _id is inherited by ObjectIDMixin (as opposed to OptionalGuidMixin)
/osf/models/collection.py
- .guids.first() can’t be changed because it’s being used to filter CollectionSubmission objects which don’t solely use the guid id as _id
/osf/models/files.py
- Can’t be changed because a Files _id is inherited by ObjectIDMixin (as opposed to OptionalGuidMixin)
/addons/wiki/models.py
- Can’t be changed because Comment.root_target must be filtered on the Guid object of the WikiPage, not the WikiPage._id.

QA Notes

Tested the change in the Admin app locally and it looks fine. QA ensuring the functionality of user_search in the Admin app would be useful.

Documentation

N/A

Side Effects

N/A

Ticket

https://openscience.atlassian.net/browse/ENG-561

Aug 02 '19 19:08 UdayVarkhedkar

I think you can improve the cached query here with something like

    @cached_property
    def _id(self):
        try:
            guid = self.guids.values_list('_id', flat=True).first()
        except IndexError:
            return None

Which is:

SELECT "osf_guid"."_id"
  FROM "osf_guid"
 WHERE ("osf_guid"."object_id" = 357137 AND "osf_guid"."content_type_id" = 28)
 ORDER BY "osf_guid"."created" DESC
 LIMIT 1

Vs the original query of:

SELECT "osf_guid"."modified",
       "osf_guid"."id",
       "osf_guid"."_id",
       "osf_guid"."content_type_id",
       "osf_guid"."object_id",
       "osf_guid"."created"
  FROM "osf_guid"
 WHERE ("osf_guid"."object_id" = 357141 AND "osf_guid"."content_type_id" = 28)
 ORDER BY "osf_guid"."created" DESC
 LIMIT 1
AFTER

Benchmarking this I find the times marginally improved, I was using inv shell --print-sql

Also is this line still being used? it should be removed if not.

Aug 06 '19 18:08 Johnetordoff

I think the query you provided is more efficient but it looks like it doesn't cache properly. So when I run uday = OSFUser.objects.get(username='[email protected]') followed by uday._id in the shell, a query to the database is run by uday._id.

I'm not 100% sure why but I'm seeing different results on uday.__dict__._prefetched_objects_cache depending on the implementation. With the original implementation, it looks like: '_prefetched_objects_cache': {'guids': <QuerySet [<id:nzm8g, referent:(<OSFUser(u'[email protected]') with guid u'nzm8g'>)>]>}, and with your suggested implementation it looks like: '_prefetched_objects_cache': {'guids': <QuerySet [<id:nzm8g, referent:(<OSFUser(u'[email protected]') with guid None>)>]>},. It doesn't look like guids aren't getting cached on any objects with the values_list implementation.

I do think that if we can get your implementation to cache properly, we should use it.

Aug 07 '19 16:08 UdayVarkhedkar

For : '_prefetched_objects_cache': {'guids': <QuerySet [<id:nzm8g, referent:(<OSFUser(u'[email protected]') with guid None>)>]>} it looks like <OSFUser(u'[email protected]') isn't saved because it's guid is none, can you save it and refresh the cache? I'm basically just looking at .__id_cache that seems to be working.

Aug 07 '19 16:08 Johnetordoff

Sorry, I copied+pasted your code and didn't acknowledge that I was trying to run things without returning the guid value. It works as expected and should help at least a little!

Aug 07 '19 17:08 UdayVarkhedkar

Tests are failing because the number of queries exceeds the number specified in specific test's assert statements.

After looking into this a bit more, I believe that the values_list query that John proposed is more efficient in isolation but is less efficient within the context of the GuidMixin.

When retrieving classes that inherit from the GuidMixin (i.e. OSFUser.objects.all()), the guids of that object are 'included' and retrieved as well. The existing _id property took advantage of this because cls.guids.first()._id did not need to make any additional calls to the database. The use of values_list in the _id property requires a call to the database the first time _id is run (after that, _id is cached properly) because the use of values_list initiates a new queryset. As a result, values_list leads to many more additional calls to the database and shouldn't be used in this case (it is an example of an 'n+1 query' which I hadn't heard of before).

Aug 08 '19 19:08 UdayVarkhedkar