klepto icon indicating copy to clipboard operation
klepto copied to clipboard

Support extracting multiple subsets from one table

Open reidab opened this issue 3 years ago • 3 comments

This is a more in-depth approach to #139 than the --data-only flag I proposed in #140. That flag may still be helpful in other cases, so I'd consider that PR separately.


This introduces the concept of table Subsets, which allow multiple Filter, Anonymise, and Relationships configurations to be attached to a single table. All functionality related to reading table data is refactored from operating on an entire table to operating on a single subset (ReadTable becomes ReadSubset).

For compatibility, Filter, Anonymise, and Relationships blocks defined at the root Table level are copied into a _deafult Subset when the config file is parsed.

So, a config like this:

[[Tables]]
  Name = "grains"

  [Tables.Filter]
    Match = "grains.size = 'large'"
    Limit = 10

  [Tables.Anonymise]
    weight = "Digits"

Will be parsed into something like this behind the scenes:

{
  Name: "grains",
  Subsets: [
    {
      Name: "_default",
      Filter: {
        Match: "grains.size = 'large'",
        Limit: 10
      },
      Anonymise: {
        weight: "Digits"
      }
    },
  ]
}

One area that I'm somewhat questioningis how things should be handled when Filter config exists both at the root level and in Subsets

For example:

[[Tables]]
  Name = "grains"

  [Tables.Filter]
    Match = "grains.size = 'large'"
    Limit = 10

  [Tables.Anonymise]
    weight = "Digits"

  [[Tables.Subsets]]
    Name = "starchy"

    [Tables.Subsets.Filter]
      Match = "grains.starchy = TRUE"
    [Tables.Subsets.Anonymise]
      name = "FirstName"

This config will currently be parsed into a tree that looks like this:

{
  Name: "grains",
  Subsets: [
    {
      Name: "_default",
      Filter: {
        Match: "grains.size = 'large'",
        Limit: 10
      },
      Anonymise: {
        weight: "Digits"
      }
    },
    {
      Name: "starchy",
      Filter: {
        Match: "grains.starchy = TRUE"
      },
      Anonymise: {
        name: "FirstName"
      }
    }
  ]
}

I think this is a reasonable approach overall — it treats the root level as its own separate subset — but I can also see how it could be confusing if people thought that the Filter, Anonymise, and Relationships blocks defined at the root level would be inherited by all subsets.

The alternative is to make this scenario throw an error, so a table can either use a root-level configuration, or it can define Subsets, but it cannot do both.

Thoughts?

reidab avatar Aug 20 '22 05:08 reidab

Converted to a draft as there are issues with tables that aren't present in the config. Working on a fix.

reidab avatar Aug 23 '22 01:08 reidab

Cleaned up handling for undefined tables — should be ready for review again.

reidab avatar Aug 23 '22 21:08 reidab

Hi @reidab :wave:

I'll need to spend a bit more time with this one, the configuration became a bit more complex as you mentioned and I want to make sure that there are no better alternatives before

lucasmdrs avatar Sep 23 '22 11:09 lucasmdrs

Hi @reidab :wave:

I've wrote an alternative to this PR on #145 which don't require changes to the toml structure. Since you brought it up the issue in #139 and wrote this solution, I would like to get your feedback on that, if it would solve the problem for you described.

lucasmdrs avatar Nov 18 '22 15:11 lucasmdrs

This PR has been automatically marked as stale because it has not had any activity in the last 14 days. It will be closed if no further activity occurs, thank you for your contributions.

stale[bot] avatar Apr 07 '23 08:04 stale[bot]

This PR has been automatically closed because it has not had any activity in the last 21 days. Feel free to re-open in case you would like to follow up.

stale[bot] avatar Apr 14 '23 10:04 stale[bot]