ISHRemote icon indicating copy to clipboard operation
ISHRemote copied to clipboard

Extend Find-IshDocumentObj and Find-IshPublicationOutput cmdlets for server-side out-of-memory protection by time slicing

Open ddemeyer opened this issue 7 years ago • 5 comments

Shorter crisper interactive experience is nice. Programming-wise, an explicit -IshSession is still preferred. Remember you can still use two sessions to compare or migrate content. Attempted as part of #45

  • [x] Requires #45 merge for New-IshSession adaptions, etc
  • [x] Get-IshEvent protected by -ModifiedSince defaulting to last day
  • [x] Get-IshBackgroundTask protected by -ModifiedSince defaulting to last day
  • [ ] Find-IshBaseline used to return everything, low risk on bringing the server down
  • [ ] Find-IshEDT used to return everything, low risk on bringing the server down
  • [ ] Find-IshOutputFormat used to return everything, low risk on bringing the server down
  • [ ] Find-IshUserGroup used to return everything, low risk on bringing the server down
  • [ ] Find-IshUserRole used to return everything, low risk on bringing the server down
  • [ ] Find-IshUser used to return everything, medium risk on bringing the server down
  • [ ] Find-IshDocumentObj used to return everything, high risk on bringing the server down but would mean breaking behavior compatibility
  • [ ] Find-IshPublicationOutput used to return everything, high risk on bringing the server down but would mean breaking behavior compatibility
  • [ ] Find-IshAnnotation #78 will return everything, medium risk on bringing the server down but would mean breaking behavior compatibility

Thinking out loud... options are...

  1. Keep backward behavior compatibility even if having an implicit IshSession a single Find-IshDocumentObj could bring everything to its knees. Current 0.x behavior, no code change required.
  2. Keep backward behavior compatibility but time slice by adding optional -ModifiedSince (DeltaDateTimeStart, the year 2000 or so), -ModifiedUntil (DeltaDateTimeEnd, so Now+1day) and -ModifiedStep (DeltaTimeSpan, so per year?). In practice the API calls would use a MODIFIED-ON filter to return less from over the API function in one go, but if not pipelined in PowerShell the client-side memory could still explode. Preferably with Write-Progress like behavior. Preferred option if I have the time, cleans up the ISHInsights DeltaCrawl code base as well.
    • Note that only Find-IshDocumentObj and Find-IshPublicationOutput need this protection I feel. All others are optional for consistency but can be implemented already over -MetadataFilter
  3. Break compatibility. Do the above -ModifiedSince (DeltaDateTimeStart, defaulting to last day), -ModifiedUntil (DeltaDateTimeEnd, so Now+1day) and -ModifiedStep (DeltaTimeSpan, so more than one day).

ddemeyer avatar Dec 17 '18 16:12 ddemeyer

First of all, I think we kind of want to have backward compatibility behavior, but still want to protect the application and database server

So I would do something like...

  • introduce 2 new optional parameters: -ModifiedSince and -ModifiedStep (possibly also -ModifiedBefore)
  • if only a metadata filter is provided, then we have the current behavior
    • So if they did not filter wisely that might still give an issue
  • if no metadata filter and none of the new optional parameters is provided, I would throw an exception to protect the system
    • I don't see how you can have a good default for the new optional parameters that will make sense for all customers
  • if only -ModifiedSince is provided, I would either do a smart default value for -ModifiedStep (per month if -ModifiedSince is less than 2 year, per year if -ModifiedSince is more than 2 year) or throw an exception that you also need to specify -ModifiedStep
  • if the metadata filter is provided and -ModifiedSince (and -ModifiedStep) are also provided, then I would throw if the MODIFIED-ON is present in the metadata filter

HildeVermeiren avatar Nov 15 '19 15:11 HildeVermeiren

Thanks, more food for thought... It looks like we are heading for option 2 so backwards compatible only doing x times more API calls then before, so theoretically somewhat slower but much more predictable for larger setups. On bigger database the Find cmdlet without any filter went wrong anyway as you attempt to pull the full database over.

  • The MODIFIED-ON will be on language level, not logical otherwise you might miss updates of blobs
  • The ModifiedSince default value would be the year 2000 for now, birth date of any database
  • The -ModifiedStep default value for PublicationOutput is 1 year while for DocumentObj it should be smaller like 2 months. Note that on very big databases, or actually databases where in those 2 months a big legacy import happened, it could still go wrong server-side or client-side - in those scenario you can overwrite the defaults provided.

Now a legacy conversion could be something better than below, the Find cmdlet could even show a progress bar

Find-IshDocumentObj -MetadataFilter (Set-IshMetadataFilterField -Level Lng -Name MODIFIED-ON -FilterOperator GreaterThan -Value 01/09/2019) |
Set-IshMetadataField -Name FCOMMENTS -Level Lng -Value "Hilde was here" | 
Set-IshDocumentObj

ddemeyer avatar Nov 15 '19 16:11 ddemeyer

In all scenarios the -ModifiedStep goes up, but you could also count down. So from very recent to the birth date of the database. This way you get recent results first which often make more sense.

ddemeyer avatar Feb 28 '20 10:02 ddemeyer

Was looking for more standardized terminology and a way to make querying from Now to database birth date the default. So still pursuing backward compatible option 2.

  • [ ] -ModifiedBefore (instead of -ModifiedUntil) would default to Now+1day (DeltaDateTimeEnd)
  • [ ] -ModifiedAfter (instead of -ModifiedSince) would default to database birth date, so year 2000 (DeltaDateTimeStart). Theoretically the last server-side Find operations will return empty results quite quickly.
  • [ ] -ModifiedStep default value for PublicationOutput is a Timespan of 1 year while for DocumentObj it should be smaller like 3 months (DeltaTimeSpan). Note that on very big databases, or actually databases where in those months a big legacy import happened, it could still go wrong server-side or client-side - in those scenario you can overwrite the defaults provided. The step would always be used to step back into history.
  • [ ] The three above parameters are all optional, and all have defaults protecting the server-side system. No need to throw. In case -MetadataFilter is offered, then we suggest to simply merge, if that causes 3+ MODIFIED-ON filters, so be it - potentially push a Write-Warning out.
  • [ ] Document the potential performance slowdown which can be bypassed by explicitly passing a massive -ModifiedStep, but would need that
  • [ ] Write-Progress is a must; showing the exact count of server-side Find operations and a progress bar.
  • [ ] As only implementation for Find-IshDocumentObj and Find-IshPublicationOutput is really required. The MODIFIED-ON will be on language level, not logical otherwise you might miss updates of blobs

Considered but not required for closing this issue

  • Align parameter set across all Find-* cmdlets, probably Find-IshAnnotation first using MODIFIED-ON on annotation level
  • Customize to other date fields, requiring -ModifiedFieldName and -ModifiedFieldLevel (on multi-card object types, always None on single card types)

ddemeyer avatar Mar 31 '20 20:03 ddemeyer

Investigating further, the idea is good, the performance and accuracy guarantees however not. ISHRemote tries to be version-agnostic where possible, for #49 there are two reasons to put this idea on hold:

  1. On older Content Manager versions only one date filter (so MODIFIED-ON) will be passed to the initial database query for an API Find operation. This means that potentially all objects are retrieved from the database server to the application server, before they get filtered again to be pushed to the client (so ISHRemote).
  2. On older Content Manager versions, on initial object creation (e.g. Add-IshDocumentObj), the MODIFIED-ON field is not filled in, only the CREATED-ON field as they are in essence the same. So a null on MODIFIED-ON simply complicates matters.

As a reminder, the main problem is how to iterate all data, even for large enterprise sets of data. Where this idea was to iterate over time, we are going back to iterating over the folder structure. Continuing with #92 and #91, together they allow to iterate the folder structure and in turn find content-objects/publicationoutputs based on filter criteria like language or recently changed.

ddemeyer avatar Apr 05 '20 17:04 ddemeyer