jackrabbit-oak icon indicating copy to clipboard operation
jackrabbit-oak copied to clipboard

OAK-10682 - [Indexing job] Improve Mongo regex filter to only use positive conditions (no negations)

Open nfsantos opened this issue 1 year ago • 0 comments

The current implementation of filtering excluded paths and custom regex is using a condition like

_id: { $nin: [ /^[0-9]{1,3}:\/content\/dam\/.*$/ ]

Mongo cannot evaluate this condition without retrieving the full document, because a value of _null would also match this condition and the index does not contain null values. Therefore, when the index contains excluded paths, the download will be much slower because Mongo has to retrieve every single document to evaluate the condition.

As a workaround, we can transform the regex on an equivalent one that matches the complement of the original regex using negative lookahead. This allows rewriting the filter condition using only positive conditions, which can be evaluated using only the index.

Performance

As an example of the difference between the two approaches, the two following queries count the number of documents that match the regex filter. Using $not in the regex:

> db.nodes.find({ $and: [{ "_modified": { "$gte": 0 } }, { _id: { $not: { $regex: /^[0-9]{1,3}:\/content\/dam\/.*$/ } } }, { _id: { $not: { $regex: /^[0-9]{1,3}:\/oak:index\/.*$/ } } }] }).sort({ "_modified": 1 }).count()
15338210

4m18s

Using a negated regex to turn the Mongo filter into a positive filter:

> db.nodes.find( { $and: [ {"_modified": {"$gte": 0}}, { _id: { $regex: /^(?![0-9]{1,3}:\/content\/dam\/)/ } }, { _id: { $regex: /^(?![0-9]{1,3}:\/oak:index\/)/ } }]} ).sort({ "_modified":1 }).count()
15338210

39s

The plan for the query with $not fetches every document and then applies the regex filters:

  stage: 'FETCH',
  filter: {
    '$and': [
      {
        _id: {
          '$not': { '$regex': '^[0-9]{1,3}:\\/content\\/dam\\/.*$' }
        }
      },
      {
        _id: { '$not': { '$regex': '^[0-9]{1,3}:\\/oak:index\\/.*$' } }
      }
    ]
  },
  inputStage: {
    stage: 'IXSCAN',
    keyPattern: { _modified: 1, _id: 1 },
    indexName: '_modified_1__id_1',
    isMultiKey: false,
    multiKeyPaths: { _modified: [], _id: [] },
    isUnique: false,
    isSparse: false,
    isPartial: false,
    indexVersion: 2,
    direction: 'forward',
    indexBounds: { _modified: [ '[0, inf.0]' ], _id: [ '[MinKey, MaxKey]' ] }
  }

While the plan for the query with the negated regex applies the filter during the index scan, fetching only the documents that match:

  stage: 'FETCH',
  inputStage: {
    stage: 'IXSCAN',
    filter: {
      '$and': [
        {
          _id: { '$regex': '^(?![0-9]{1,3}:\\/content\\/dam\\/)' }
        },
        { _id: { '$regex': '^(?![0-9]{1,3}:\\/oak:index\\/)' } }
      ]
    },
    keyPattern: { _modified: 1, _id: 1 },
    indexName: '_modified_1__id_1',
    isMultiKey: false,
    multiKeyPaths: { _modified: [], _id: [] },
    isUnique: false,
    isSparse: false,
    isPartial: false,
    indexVersion: 2,
    direction: 'forward',
    indexBounds: { _modified: [ '[0, inf.0]' ], _id: [ '["", {})' ] }
  }

nfsantos avatar Mar 12 '24 11:03 nfsantos