confidence Evaluating binomial & non-binomial metrics simultaneously with Experiment class & how to categorise metrics by type

I can't seem to find much info on how to exactly evaluate both binomial and non-binomial metrics at the same time within a dataframe that gets input within the Experiment class.

It seems that, even with the method column specified, that multiple_difference treats it as a binomial metric. You would obviously need different inputs to perform a t-test, so how would I add and specify these columns? If so, how would I indicate these in Experiment?

Likewise, there's a really good paper you posted on your risk-aware product decision framework using multiple metrics - and I've seen mention of success metrics within the repository/q&a - however there's no documentation I could find that indicates how to specify success, deterioration, and guardrail metrics. I did see a method on sample ratio which is a form of a quality metric, so I suspect this has been considered but it's difficult to see how to implement the entire approach.

Do let me know if you need any further information. Thanks for your time!

Oct 31 '24 23:10 tmann01

Here's an example (that could be put in an example notebook or as a test case for the Experiment class):

columns = [
    "group_name",
    "num_user",
    "sum",
    "sum_squares",
    "method",
    "metric",
    "preferred_direction",
    "non_inferiority_margin",
]
data = [
    ["Control", 6267728, 3240932, 52409321212, "z-test", "m1", "increase", 0.15],
    ["Test", 6260737, 3239706, 52397061212, "z-test", "m1", "increase", 0.15],
    ["Test", 6260737, 38600871, 12432573969, "z-test", "m2", None, None],
    ["Control", 6225728, 35963863, 18433512959, "z-test", "m2", None, None],
    ["Test", 62607, 26738, None, "chi-squared", "m3", "increase", None],
    ["Control", 62677, 16345, None, "chi-squared", "m3", "increase", None],
]
df = pd.DataFrame(columns=columns, data=data)
test = spotify_confidence.Experiment(
    data_frame=df,
    numerator_column="sum",
    numerator_sum_squares_column="sum_squares",
    denominator_column="num_user",
    categorical_group_columns="metric",
    interval_size=0.99,
    correction_method="bonferroni",
    metric_column="metric",
    treatment_column="group_name",
    method_column="method",
)

diff = test.multiple_difference(
    level="Control",
    level_as_reference=True,
    groupby="metric",
    non_inferiority_margins=True,
)

display(diff)
test.multiple_difference_plot(
    level="Control",
    level_as_reference=True,
    groupby="metric",
    non_inferiority_margins=True,
    use_adjusted_intervals=True,
    absolute=False,
).show('html')

Guardrail metrics are specified by providing a NIM, non-inferiority margin as for the m1 metric. In the example we fail to reject the hypothesis "m1 Test is no worse than m1 Control" since some of the metric's CI is below the NIM.

Success metrics can be one sided or two sided as specified by preferred_direction. I guess the difference between success and deterioration metrics are that success metrics are what you're actually trying to improve, while deterioration metrics are metrics that you would hope stay neutral, often related to performance/latency/numbers crashes etc.

Nov 01 '24 10:11 iampelle

The way you’d implement the deterioration metrics is to take the same data as you’d use for your main results (like Pelle provided), but flip the preferred direction, not use any NIMs, and use a separate alpha.

For example, suppose you have one success metric that should improve and one guardrail metric (with a NIM) for which an increase is a good change. You would then use that as in Pelle’s example and set preferred direction to increase and set the NIM for the guardrail metric. Next, you would make a similar call but set preferred direction to decrease and not set a NIM. You will then test whether any of the two metrics has significantly moved in the wrong direction. In the paper, we also use a different alpha for this test - so it’s using a separate budget. You can also include a sample ratio mismatch test here via the chi-squared test.

Nov 01 '24 13:11 ankargren

Hi all,

Apologies for late response. Thanks for looking into this, it all makes sense! I've successfully gotten this to work so you can close this ticket (I'm on a separate account so I cannot do this myself).

May 21 '25 14:05 tmanndd