FluidFramework Azure Client document recovery API

This item is a follow up item on #8870. It communicates 1.0 goals/deliverables around recovery API.

Objective

For 1.0 we are aiming to surface basic set of tools needed for user to retrieve data from corrupted container. From there we are hoping to learn, validate assumptions and make improvements post 1.0. For example, we need to learn:

What “minimal data loss” means to our clients, and what configuration options we need to support that.
What general expectations are as compared to other services they are using.
etc.

General Considerations

In the first iteration we will try to avoid making technical decisions that will offer us less flexibility going forward or introduce unnecessary complexity.
We are time constrained by GA release.
Our short-term solution assumes certain type of client persona - using Fluid Containers for transient real-time collaboration with alternate storage as the primary source of truth. Again, we still need to learn and verify those assumptions.
Minimal API. We need to make sure that APIs do not expose underlining guts of the system. For example, maybe we don’t want snapshots to be surfaced out even though snapshotting capability may be used as an underlining mechanism. We should be able to switch the mechanisms without significant changes in the client code.

Base v1.0 functionality

For v1.0, at minimum, we want to support:

Ability to detect document was corrupted.
Ability to query for historical “versions” of a document.
Ability to re-create a new document out of specific version of another (in our case - corrupted) document.

In short, we are looking for API that allows user to extract and re-create version of the document that was not in the corrupted state. The outcome is the new document. As part of 1.0 solution, we do not look into what further actions clients may decide on, now that they have recovered data.

What we are not supporting for v1.0

There are certain capabilities that will not come through v1.0. The reasoning being one (or more) of the following:

(1) We still need to learn about use-cases and needs before setting the direction. (2) FRS/FF teams do not have sufficient time to offer supporting features for v1. (3) Introduces complexity that does not justify the cost (yet).

We are not looking for ability to reset identity of the document. In other words, the consuming code will need to decide what to do with the recovered container. It’s a new instance having a different document ID than original (corrupted) container. As we look at alternative mechanisms (say leveraging native storage mechanism in the future), we may have an easier path towards implementing this functionality. GIT api may offer us interesting options we could explore (document IDs are really just ref-s that can be repointed to any commit/snapshot).
We are not looking at any logic that deals with synchronization of clients to ensure they all land on the same recovered document. Through examples we may consider synchronization capabilities layered on top of FF, but this is not priority for V1.
We are not looking at the capability to "lock" document (and allow no further actions). We are making assumption that all connected clients will land to corrupted state fairly quickly, but that's only an assumption. We may need this functionality only if we realize that all clients do not fail fast.
Different customers may have a different idea what “minimal data” loss is. We are not offering ability to expose how often document versions should be saved and max number of those versions (per document). We may expose this functionality as an improvement, post-GA, as we learn more about client needs.
We are not offering ability to cleanup/delete unused documents that may result out of usage of this API. We are making assumption that corruption will not occur frequently and that recovered docs will usually have a small size (see assumptions above around the type of client persona)

More detailed take and open topics around post v1.0 efforts are here: #9670.

Design Considerations

Overview

For 1.0 we will be leveraging container snapshots for recovery purposes. It’s a native FF mechanism offering FRS clients the most cost-effective way to preserve document versions, without introducing another storage-level versioning layer (that we do not have) that will need its own considerations around management/configuration and cost implications. With this approach, each "version" of the document would effectively map to a summary/snapshot that can be used to hydrate/create another document. More details on the possible paths here will be described below.

Assumptions

Following assumption will determine effectiveness of this solution:

FF should fairly quickly converge on corrupted state. Therefore, one of the most recent FF snapshots/summaries should yield valid container. In other words, the API consumer will not have to cycle through long history of document versions to recover data.
We will rely on container’s ability to determine corrupted state and close the container with appropriate (Corrupted Data) event emitted. Whether the error was rooted in FF (client) code, server, or user action, the final outcome is a container determining corruption, closing and emitting error event. We need to keep in mind that "corrupted" container could be caused by a bad session and could be recoverable through reopening. We are leaving up to application code how to handle those variations.

Dependencies

We have couple of external dependencies here:

FRS should keep snapshots long enough to have meaningful history to query the data from. Various data/privacy policies may effect this ability down the road.
Given we will be creating new docs from downloaded summaries, we will likely require ability to create new FRS documents starting with sequence number that is NOT zero.

"Rehydration" options

Currently, we FF only supports rehydration from "detached" snapshots. Now, we need to surface ability to hydrate from downloaded snapshots which means we will have to properly digest/handle various transient data. In particular:

General runtime-level data that needs to be cleaned up: quorum, elected summarizer etc.
Sequence numbers. Given most of the consensus-based DDS-es depend on running sequence number we will need to preserve starting sequence from downloaded summary as we hydrate new document. Full implications are still being determined.
Connected clients. Some consensus-based DDD-es (like task manager) do depend on client information. Likely, we can use quorum data to clean up any client information when DDS is loaded.

We have three options here:

(1) Fully Sanitize/Validate snapshot before loading

Here we would clean up all transient data before handing it over to loader. Clean up means:

Removing all client data.
Resetting sequence num# to zero.
General cleanup of runtime data (ex. summarizer).

Since DDSs summaries do depend on sequence #, we would end up placing a burden of sanitization on DDS-es as well. Further we would need to maintain this logic away from loader/runtime/dds, which would make it hard to maintain, as we cannot fully rely on version # to determine valid shape of snapshot trees.

(2) Fully Sanitize/Validate snapshot while loading

Similar option as above, the difference being we will be embedding sanitization as part of loader/runtime/dds flow. It's up to individual layer to interpret/sanitize its part of snapshot tree.

(3) Allow Loader/Container/DDSes to digest downloaded summaries

With this approach, "detached" flow will ensure base top-level data (clients, quorum) is sanitized, but the rest of the snapshot can be consumed by loader/runtime/DDSes, as is. In other words, each layer should be able to:

Validate any transient data before applying it. For example, OrderedClientElection is validating that elected client listed in the snapshot is actually present, and then it discards snapshot information (about elected summarizer) if it's not present.
Handle non-zero seq# in detached state. Most of our existing logic is doing this already. For example, some consensus-based DDSes (as sequence), are relying on ops and delta manager to bump min sequence number as part of normal flow. When we initialize DDSes we could ensure that delta manager is always the source of truth.

If we do not find any major objectives, this approach will give us most flexibility and cohesion going forward.

One note here is that document service should be able to create new docs with non-zero seq. #. Today, ODSP may not be able to do that, while FRS can. We can handle these variations at document service factory level where each factory can state capabilities that it supports. This flag may provide us opportunity to expose this feature only on FRS, and learn and solve bugs/issues, while limiting functionality only to FRS to start with.

Drawbacks

We are not dealing with large data in the most effective way. With snapshot recovery we need to fetch and transform individual summaries on the client side. Cycling through version history, looking for non-corrupted version will be time consuming process. We will explore how important this concern is in the short term. Again, we are assuming initial usage of Fluid Containers for transient real-time collaboration with alternate storage as the primary source of truth.

Open Topics

Consider compatibility concerns (Details TBD). Lazy loading and implications. Understand importance of including checksum work. Auth. considerations

Azure Fluid Relay API Gaps observed

At minimum, we would rely on three FRS API calls:

Get snapshot history/versioning for a given document.
Get document snapshot/summary at a specific version.
Upload summary (existing call)

Current gaps:

Get snapshot/summary versioning history: timestamps and treeIDs seems to be off or missing.
Get snapshot/summary at specific version: we always get only the “latest” summary.

Tasks

[x] #8870 (Idea exploration)
[x] Complete Demo / Presentation.
[x] #9649
[x] #9656
[x] #9651
[x] #9538
[x] #9833
[x] #10253

Mar 10 '22 22:03 ssimic2

Some thoughts based on reading description:

"Our short-term solution assumes certain type of client persona - using Fluid Containers for transient real-time collaboration". For this scenario I'd not bother solving this problem. As developers using Fluid already have stable version of the doc (in their alternate storage) that they can revert back by discarding Fluid session;
"Ability to query for historical “versions” of a document" - Is this a requirement? This limits solution to existing storage (Azure) in the form it exists today. I'd not be surprised that over time it takes a form more closely resembling ODSP (mostly because it's a natural evolution based on user requirements), where these concepts do not exist at Fluid level (i.e. there is only 1 snapshot stored in a file, but storage provide other ways to version documents).
- I'd think having access to latest snapshot is sufficient, as going beyond that (some kind of binary search across versions searching for the "right one" is likely out of scope).

Apr 04 '22 04:04 vladsud

Some thoughts based on reading description:

"Our short-term solution assumes certain type of client persona - using Fluid Containers for transient real-time collaboration". For this scenario I'd not bother solving this problem. As developers using Fluid already have stable version of the doc (in their alternate storage) that they can revert back by discarding Fluid session;

"Ability to query for historical “versions” of a document" - Is this a requirement? This limits solution to existing storage (Azure) in the form it exists today. I'd not be surprised that over time it takes a form more closely resembling ODSP (mostly because it's a natural evolution based on user requirements), where these concepts do not exist at Fluid level (i.e. there is only 1 snapshot stored in a file, but storage provide other ways to version documents).

I'd think having access to latest snapshot is sufficient, as going beyond that (some kind of binary search across versions searching for the "right one" is likely out of scope).

Yes for 1. this is our starting assumption. It does not need to be correct though. And unless we explicitly state we do not support any scenarios other than the assumed one, we do need some base provision for recovery. We can learn from there, validate assumptions, and align them post 1.0 #9670

For 2. we do not have explicit requirements for versioning. The latest snapshot could possibly be corrupted. The idea here is to allow caller to "linearly" go back in time (through snapshot versions) until they can recreate doc successfully. However, we could contain that functionality within azure API, and not surface versioning. We were considering other future use-cases (arbitrary rollbacks) we could build off of versioning, but if post 1.0 we anyways go down ODSP path, this sort of API may not be relevant. This child ticket discusses API specific topics: https://github.com/microsoft/FluidFramework/issues/9651.

Apr 05 '22 16:04 ssimic2

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

Feb 01 '23 17:02 microsoft-github-policy-service[bot]