cloudserver problem in listing object in root when metadata becomes massive

Hi.

when i'm trying to list object in root (using mongodb as metadata client) parameters passed to \arsenal\lib\storage\metadata\mongoclient\readStream.js are empty and when number of returned docs exceeds 1000, the marker param gets a value and gets used for string comparison based querying. now my problem is when bucket metadata becomes big (in my case "size" : 15233527369.0, "count" : 12283724) this is so time consuming that the connection fails. as a workaround i used regex for root and for pagination changed maxKeys to 10000 but the problem remains.

Apr 29 '18 09:04 siamackSeifi

@siamackSeifi Thanks for the issue. I think the maxKeys default is not being set when listing from Mongo. We are looking into this and I will post back with more info once I have more information and then follow up with a bug fix.

Apr 30 '18 18:04 rahulreddy

@rahulreddy You're welcome.

I think the problem is the querying approach. For listing objects the MongoReadStream class tunes "const query" and finds docs. now when trying to list the root of a bucket, query becomes empty {} so "c.find(query)" returns everything and that's where we have a problem. In order to fix that i used query._id = { "$in": [/^((?!/).)/$/i, /^[^/]$/]}; to have a filter when the request is for root, but because the maxKeys is set to 1000, if i have more documents returned, a value gets assigned to marker and for the next 1000 docs (1001 to 2000) the query const would be something like { _id: { '$gt': 'frcl22891/' } } and the same problem accurs (because i have about 12 million documents). So i set the listingHardLimit to something i guess i would never pass (10000 for example) and for actualMaxKeys i used Math.max(constants.listingHardLimit, requestMaxKeys) instead of min. With that if returned docs of a query never exceeds 10000, i am only using regex and everything is good. But this approach is just a very bad workaround :)

Hope this helps

May 20 '18 11:05 siamackSeifi

I looked into this and one of the reasons CloudServer gets a bigger subset than the max keys is that the list returned doesn't always represent the list intended for the user as it may container internal objects (placeholders, versions etc). There is a planned optimization to be done to make it more efficient when retrieving a large data set from a collection.

May 29 '18 18:05 rahulreddy