INTPYTHON-617 - Improve DictField to_python performance
Improves performance of large documents with nested DictFields by nearly 20x, addressing #1230.
Modified benchmark script pulled from https://stackoverflow.com/questions/35257305/mongoengine-is-very-slow-on-large-documents-compared-to-native-pymongo-usage/:
import datetime
import itertools
import random
import timeit
from collections import defaultdict
import mongoengine as db
db.connect("test-dicts")
class MyModel(db.Document):
date = db.DateTimeField(required=True, default=datetime.date.today)
data_dict_1 = db.DictField(required=False)
MyModel.drop_collection()
data_1 = ['foo', 'bar']
data_2 = ['spam', 'eggs', 'ham']
data_3 = ["subf{}".format(f) for f in range(5)]
m = MyModel()
tree = lambda: defaultdict(tree)
data = tree()
for _d1, _d2, _d3 in itertools.product(data_1, data_2, data_3):
data[_d1][_d2][_d3] = list(random.sample(range(50000), 20000))
m.data_dict_1 = data
m.save()
def pymongo_doc():
return db.connection.get_connection()["test-dicts"]['my_model'].find_one()
def mongoengine_doc():
model = MyModel.objects.first()
return model
if __name__ == '__main__':
print("pymongo took {:2.2f}s".format(timeit.timeit(pymongo_doc, number=10)))
print("mongoengine took {:2.2f}s".format(timeit.timeit(mongoengine_doc, number=10)))
Before:
pymongo took 0.21s
mongoengine took 4.98s
After:
pymongo took 0.20s
mongoengine took 0.20s
Does your benchmark still apply? You seem to have changed the code substantially, and I would assume that the to_python caching you do at the top level not to be that impactful (but maybe it is). If this is really the case, then it may make sense to move your change down into the base class so this applies to lists too.
Does your benchmark still apply? You seem to have changed the code substantially, and I would assume that the
to_pythoncaching you do at the top level not to be that impactful (but maybe it is). If this is really the case, then it may make sense to move your change down into the base class so this applies to lists too.
What do you by mean "still apply"? The benchmarking script was run against the current master branch as of June with and then without the change to DictField.to_python.
Does your benchmark still apply? You seem to have changed the code substantially, and I would assume that the
to_pythoncaching you do at the top level not to be that impactful (but maybe it is). If this is really the case, then it may make sense to move your change down into the base class so this applies to lists too.What do you by mean "still apply"? The benchmarking script was run against the current master branch as of June with and then without the change to
DictField.to_python.
I wrote what I meant right after
You seem to have changed the (your) code substantially
It's not obvious what commit you ran your benchmarking from, so I was just confirming. The rest of my comment applies. Have you profiled this to see what code is the bottleneck that you're removing?
Does your benchmark still apply? You seem to have changed the code substantially, and I would assume that the
to_pythoncaching you do at the top level not to be that impactful (but maybe it is). If this is really the case, then it may make sense to move your change down into the base class so this applies to lists too.What do you by mean "still apply"? The benchmarking script was run against the current master branch as of June with and then without the change to
DictField.to_python.I wrote what I meant right after
You seem to have changed the (your) code substantially
It's not obvious what commit you ran your benchmarking from, so I was just confirming. The rest of my comment applies. Have you profiled this to see what code is the bottleneck that you're removing?
The speedup comes from not calling to_python on builtin Python types that don't need to be converted. Here's the snakeviz visualization of the profiling for master versus this change:
Master
This PR
The call graph shown is different due to nearly all of the calls from the root being PyMongo driver background operations
The core difference shown is that in the current behavior, DictField.to_python makes 6,000,400 recursive calls to itself even though all of its data is primitive Python types and not other Field objects.
In comparison, with this change, DictField.to_python makes 10 calls without any recursion.
I was curious and had a quick look yesterday but I'm suspecting that it's only working with simple dicts (with native types that don't need any parsing), we must double check that it's also working with nested reference fields or embedded docs or fields that would require conversion when loaded from db. If you look at the basecomplexfield.to_python it has some code related with that that the current proposal is just skipping
I was curious and had a quick look yesterday but I'm suspecting that it's only working with simple dicts (with native types that don't need any parsing), we must double check that it's also working with nested reference fields or embedded docs or fields that would require conversion when loaded from db. If you look at the basecomplexfield.to_python it has some code related with that that the current proposal is just skipping
Is passing the existing test suite sufficient to say that this change doesn't introduce any breaking changes? Or is there a specific example or test I could add to verify correct behavior beyond what already exists in the suite?
Is this the test that proves embedded reference fields still work as expected? https://github.com/MongoEngine/mongoengine/blob/e844138d9d4448c705b7330da654b44bc31e3692/tests/fields/test_dict_field.py#L345-L390