microsoft-performance-toolkit-sdk icon indicating copy to clipboard operation
microsoft-performance-toolkit-sdk copied to clipboard

Investigate support for composed columns

Open mslukebo opened this issue 4 years ago • 1 comments

Note: this feature request is not committed to being implemented. Consider this thread a request-for-comments on the idea of composed columns.

Currently, columns and projections act as wrappers around data inside a flat table. Columns cannot advertise relationships between each other, which imply that when aggregating a column only values within said column can be used. This fundamentally limits the types of data that can be correctly aggregated. For instance, suppose a column that projects to a ratio:

C[i] = A[i] / B[i]

where i is an index into a conceptual flat table. We cannot, for example, perform the following aggregations:

Sum(C[i...j]) = Sum(A[i...j]) / Sum(B[i...j])

since, with the current column limitations, we only support aggregating this as

Sum(C[i...j]) = (A[i] / B[i]) + (A[i+1] / B[i+1]) + ... + (A[j] / B[j])

Note that for many projections, the second aggregation is sufficient. However, in the case of summing ratios, it does not work.

To solve this problem, this feature request proposes the idea of composed columns. A composed column C is a column definition that includes

  1. A list of every column X, Y, ... that C depends upon
  2. A function mapping values from X, Y, ... to a value

Composed columns conceptually do not map flat indices to values. Instead, they map arbitrary input values to an output value. This is fundamentally different from current columns, which are made of projections from an int row index to a value.

Since SDK drivers/the SDK runtime know what columns C depends upon, it is free to perform aggregations on those columns and pass the aggregated values into C's function. To achieve the initial desired aggregation, we can define

C = A / B

then configure A and B to be aggregated via Sum. When a driver/the runtime needs to aggregate a value on C, it would instead aggregate the same rows on A and B, then pass in those aggregated values into the method A / B.

mslukebo avatar Oct 21 '21 14:10 mslukebo

Should we decide to go down this route to solve the larger problem of these more complex aggregations, here are some implementation notes:

  1. Composed columns would have be first class citizens of a table. There would be a new method such as ITableBuilder.AddComposedColumn and likely a new ComposedColumnConfiguration class
  2. It would be useful to have composed columns specify a format string in addition to a name, so a driver/the runtime can dynamically update any type of column header based on the state of dependent columns
  3. Aggregations may require "upgrading" types (e.g. int -> long or float -> double), so we may want to enforce any "composition funcitons" take in upgraded types relative to the types of dependent columns

mslukebo avatar Oct 21 '21 14:10 mslukebo