iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Introduce MetricsMaxInferredColumnDefaultsStrategy

Open jkolash opened this issue 8 months ago • 6 comments

These changes address issue #11253 allowing for setting of a new default strategy that considers the total number of field metrics rather than just the number of top level columns.

the new property is write.metadata.metrics.max-inferred-column-defaults.strategy

and valid values would be original, depth, breadth

It currently preserves the original default behavior as changing that may be a more disruptive change as it could lead to unexpected performance regressions. A Breadth first strategy would likely be most compatible with the original strategy so it would be safer to default into vs the depth strategy. The original strategy could then be deprecated and removed in the future

This could also easily support a previously discussed feature of reversing order of field ids for considering defaults. Though that won't be included in this PR

I'm inclined to remove the depth strategy unless there is a strong desire to keep it.

jkolash avatar May 13 '25 13:05 jkolash

I brought this up at the community sync to gauge what other people thought about changing the behavior here without introducing a "strategy" option and the response was positive. Our rationale was that tables that have deep nesting and a lot of top-level columns are uncommon and would likely benefit from removing a lot of unnecessary metrics overall. So rather than introducing a strategy, I think we should move forward with the "breadth" approach: keep stats for top-level primitive columns, then the next layer, and so on until the 100 primitive field limit is exhausted.

rdblue avatar May 30 '25 18:05 rdblue

Thanks, I will simplify this PR to just include the breadth strategy and without a new property.

jkolash avatar May 30 '25 18:05 jkolash

So while doing a self review I came to question why a user provided default should not be bounded. this was the behavior before, but I'm not quite sure it is right. Users can set the write.metadata.metrics.max-inferred-column-defaults property so that allows more user control over the bounding behavior vs none at all.

jkolash avatar Jun 03 '25 00:06 jkolash

Stress testing this. For a schema with a schema with 1 million structs I got ~133ms per iteration when bounding to 100 fields.

  @Test
  public void perf(){
    AtomicInteger fieldId = new AtomicInteger(0);

    Supplier<Types.NestedField> newStruct = () -> required(fieldId.getAndIncrement(), String.valueOf(fieldId.get()),
            Types.StructType.of(required(fieldId.getAndIncrement(), String.valueOf(fieldId.get()), Types.IntegerType.get())));

    int items = 1000000;
    Schema schema = new Schema(IntStream.range(0, items).mapToObj( (i) -> newStruct.get()).collect(Collectors.toList()));

    StopWatch sw = StopWatch.createStarted();
    int iterations = 100;
    for (int i = 0; i < iterations; i++) {
      MetricsConfig.boundedBreadthFirstSubSchema(schema, 100);
    }
    sw.stop();
    System.out.println("ms per iteration: "+ sw.getTime(TimeUnit.MILLISECONDS) * 1.0/iterations);
  }

Output:

ms per iteration: 133.08

jkolash avatar Jun 05 '25 12:06 jkolash

@rdblue Let me know what else is needed for this PR.

jkolash avatar Jun 16 '25 11:06 jkolash

Should I re-open it with a new description?

jkolash avatar Jun 16 '25 11:06 jkolash

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Jul 17 '25 00:07 github-actions[bot]

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions[bot] avatar Jul 25 '25 00:07 github-actions[bot]

Merged. Thanks, @jkolash!

rdblue avatar Aug 07 '25 19:08 rdblue