"ValueError: shape mismatch" when running difference test after setting own priors in BetaBinomialx
Hi,
running spotify_confidence in ipynb, the following happens when running this line of code:
# Set df
df = import.groupby(['exp_var']).agg({'participants': 'count',
k: 'sum'}).reset_index()
# Set priors
df['prior_alpha'] = 10000
df['prior_beta'] = 10000
## df:
# | index | exp_var | participants | k | prior_alpha | prior_beta
# 0 | control | 12345 | 1234 | 10000 | 10000
# 1 | intervention | 54321 | 4321 | 10000 | 10000
# Test using confidence library
test = spotify_confidence.BetaBinomial(data_frame=df,
numerator_column=k,
denominator_column='participants',
categorical_group_columns=['exp_var'],
prior_alpha_column = 'prior_alpha',
prior_beta_column = 'prior_beta')
# Result
results = test.difference('control', 'intervention', absolute=False)
raises error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[***](***) in <cell line: 22>()
20
21 # Result
---> 22 kpi_results = test.difference('control', 'intervention', absolute=False)
7 frames
[) in _sample_posterior(self, group_df, posterior_sample_size)
120 posterior_sample_size = self._monte_carlo_sample_size
121 posterior_alpha, posterior_beta = self._posterior_parameters(group_df)
--> 122 posterior_samples = np.random.beta(posterior_alpha, posterior_beta, size=posterior_sample_size)
123 return posterior_samples
124
numpy/random/mtrand.pyx in numpy.random.mtrand.RandomState.beta()
_common.pyx in numpy.random._common.cont()
_common.pyx in numpy.random._common.cont_broadcast_2()
__init__.cython-30.pxd in numpy.PyArray_MultiIterNew3()
ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (500000,) and arg 1 with shape (2,).
From what I gather, this is due to the way the priors are handled in the BetaBinomial class:
if prior_alpha_column is None or prior_beta_column is None:
self._alpha_prior, self._beta_prior = (0.5, 0.5)
else:
self._alpha_prior = data_frame[prior_alpha_column]
self._beta_prior = data_frame[prior_beta_column]
When the difference test is run, the whole column is applied rather than the individual values for the priors in the rows of the column. This may be mitigated by setting self._alpha_prior = data_frame[prior_alpha_column][0] or a similar solution.
Also, one may want to consider a setup such that the user applies a prior only to alpha or beta, whereas the other parameter will be defaulted. Currently, if a prior is applied to only alpha or only beta, the function will retort to the default for both parameters without the user being informed of this.
Last and least, the sampling procedure in _sample_posterior is not MCMC as the comments suggest. It's Monte Carlo, but it's not Markov Chain Monte Carlo. Which is actually a good thing in this setting, since the posterior is known and we thus do not need to waste computational time on MCMC.
Best regards and many thanks in advance! Sam