klepto Support extracting multiple subsets from one table

This is a more in-depth approach to #139 than the --data-only flag I proposed in #140. That flag may still be helpful in other cases, so I'd consider that PR separately.

This introduces the concept of table Subsets, which allow multiple Filter, Anonymise, and Relationships configurations to be attached to a single table. All functionality related to reading table data is refactored from operating on an entire table to operating on a single subset (ReadTable becomes ReadSubset).

For compatibility, Filter, Anonymise, and Relationships blocks defined at the root Table level are copied into a _deafult Subset when the config file is parsed.

So, a config like this:

[[Tables]]
  Name = "grains"

  [Tables.Filter]
    Match = "grains.size = 'large'"
    Limit = 10

  [Tables.Anonymise]
    weight = "Digits"

Will be parsed into something like this behind the scenes:

{
  Name: "grains",
  Subsets: [
    {
      Name: "_default",
      Filter: {
        Match: "grains.size = 'large'",
        Limit: 10
      },
      Anonymise: {
        weight: "Digits"
      }
    },
  ]
}

One area that I'm somewhat questioningis how things should be handled when Filter config exists both at the root level and in Subsets

For example:

[[Tables]]
  Name = "grains"

  [Tables.Filter]
    Match = "grains.size = 'large'"
    Limit = 10

  [Tables.Anonymise]
    weight = "Digits"

  [[Tables.Subsets]]
    Name = "starchy"

    [Tables.Subsets.Filter]
      Match = "grains.starchy = TRUE"
    [Tables.Subsets.Anonymise]
      name = "FirstName"

This config will currently be parsed into a tree that looks like this:

{
  Name: "grains",
  Subsets: [
    {
      Name: "_default",
      Filter: {
        Match: "grains.size = 'large'",
        Limit: 10
      },
      Anonymise: {
        weight: "Digits"
      }
    },
    {
      Name: "starchy",
      Filter: {
        Match: "grains.starchy = TRUE"
      },
      Anonymise: {
        name: "FirstName"
      }
    }
  ]
}

I think this is a reasonable approach overall — it treats the root level as its own separate subset — but I can also see how it could be confusing if people thought that the Filter, Anonymise, and Relationships blocks defined at the root level would be inherited by all subsets.

The alternative is to make this scenario throw an error, so a table can either use a root-level configuration, or it can define Subsets, but it cannot do both.

Thoughts?

Aug 20 '22 05:08 reidab

Converted to a draft as there are issues with tables that aren't present in the config. Working on a fix.

Aug 23 '22 01:08 reidab

Cleaned up handling for undefined tables — should be ready for review again.

Aug 23 '22 21:08 reidab

Hi @reidab :wave:

I'll need to spend a bit more time with this one, the configuration became a bit more complex as you mentioned and I want to make sure that there are no better alternatives before

Sep 23 '22 11:09 lucasmdrs

Hi @reidab :wave:

I've wrote an alternative to this PR on #145 which don't require changes to the toml structure. Since you brought it up the issue in #139 and wrote this solution, I would like to get your feedback on that, if it would solve the problem for you described.

Nov 18 '22 15:11 lucasmdrs

This PR has been automatically marked as stale because it has not had any activity in the last 14 days. It will be closed if no further activity occurs, thank you for your contributions.

Apr 07 '23 08:04 stale[bot]

This PR has been automatically closed because it has not had any activity in the last 21 days. Feel free to re-open in case you would like to follow up.

Apr 14 '23 10:04 stale[bot]