fidesops icon indicating copy to clipboard operation
fidesops copied to clipboard

Fidesops should confirm which collections are reachable pre-execution, and gracefully error when tables are unreachable

Open iamkelllly opened this issue 4 years ago • 5 comments

Before Fidesops begins to execute a privacy request, it should...

  1. be aware of which databases (or in a future state, HTTPS connectors) an execution will require based on the policy and projected identity graph
  2. confirm whether Fidesops can access those database connections
  3. execute the request only on the minimum required databases to execute the privacy request given the identity graph and policy

If a database table/collection is not reachable, Fidesops should noticeably error so that the user/operator knows exactly which db connection, which collection (if applicable) and which field(s) (if applicable) is unreachable, and for what reason.

iamkelllly avatar Nov 22 '21 23:11 iamkelllly

This means Fidesops will gracefully error if it can't reach a table that is defined in the dataset. Knowing which tables are required to be reachable requires knowledge of the Policy in execution time.

More refinement is required here — how can we define reachability from the Policy rather than in general.

From talking to @stevenbenjamin it looks like we can pop a refinement session in the calendar for this ticket on Monday or Tuesday (29 or 30 of November).

seanpreston avatar Nov 24 '21 17:11 seanpreston

Specifically, what I think we need here is the ability for Fidesops to support traversing datasets where there are collections that are not reachable - as long as those collections don't contain any data categories that are needed by the request policy.

That's kind of a mouthful, but it's easier to see via example.

Example

For this example, assume we have this dataset:

dataset:
  - fides_key: test_dataset
    collections:
      - name: address
        fields:
          - name: city
            data_categories: [user.provided.identifiable.contact.city]
          - name: id
            data_categories: [system.operations]
            fidesops_meta:
              primary_key: True

      - name: customer
        fields:
          - name: address_id
            data_categories: [system.operations]
            fidesops_meta:
              references:
                - dataset: test_dataset
                  field: address.id
                  direction: to
          - name: created
            data_categories: [system.operations]
          - name: email
            data_categories: [user.provided.identifiable.contact.email]
            fidesops_meta:
              identity: email
          - name: id
            data_categories: [user.derived.identifiable.unique_id]
            fidesops_meta:
              primary_key: True

      - name: report_cache
        fields:
          - name: report_hash
            data_categories: [system.operations]
          - name: report_value
            data_categories: [user.derived.nonidentifiable]

This dataset has three collections:

  1. address which contains user identifiable data, reachable via customer.address_id
  2. customer which contains user identifiable data, reachable via customer.email
  3. report_cache which is some imaginary cache that stores the result of some computation for a user; this isn't identifiable, but in theory you might annotate this as user.derived.nonidentifiable as it's derived from their usage

Note that the report_cache table isn't traversable via fidesops - the application simply uses this as a persistent cache based on some hash that it computes in app code that we don't understand.

Depending on the policy, this dataset should be either supported or not!

Case 1: Target Data Categories = [user.provided] -> VALID

When the request policy is for user.provided, this should be a valid dataset as we only need to traverse customer and address to get everything we need.

Case 2: Target Data Categories = [user] -> INVALID

When the request policy is for user, this should be an invalid dataset as we should also need to traverse report_cache to get to the report_value field, but that's not possible.

Today, our code would correctly reject this dataset for case 2, but would also reject this dataset for case 1. This ticket should change that and allow both cases to work as expected 👍

NevilleS avatar Nov 24 '21 17:11 NevilleS

This is very clear. Brings up a few questions (of course):

  • a) since the validity of the traversal is dependent on the categories and use, on a request that asks for multiple actions, how to we respond to a situation where "we can take action A, but not B"?
  • b) is it ever valid to run a partial traversal? That is, maybe it's better, instead of failing the whole run, to echo something like "We can run this but the collections [db1.tableY, db2.tableX] are unreachable - do you want to proceed?" This could simplify how we respond to a)
  • c) we could also allow for some kind of annotation like "do_not_query" or something that would indicate something like "This should be skipped in access/This should be skipped in erasure/This can be skipped in... ".

stevenbenjamin avatar Nov 24 '21 17:11 stevenbenjamin

I don't think you meant to close this- reopening!

NevilleS avatar Nov 24 '21 18:11 NevilleS

Kicking out of 1.1.0 in favor of moving to the prioritized backlog. Not blocking for the release but is a really solid optimization for query running.

iamkelllly avatar Nov 29 '21 15:11 iamkelllly