python_mozetl
python_mozetl copied to clipboard
test_taar_similarity.test_compute_donors causing test failures
https://github.com/mozilla/python_mozetl/blob/0f8189f87f857f43e9c0142f9c612a0bcc28978c/tests/test_taar_similarity.py#L258-L263
________________________________________________ test_compute_donors ________________________________________________
spark = <pyspark.sql.session.SparkSession object at 0x7fa2c3b17f10>
addon_whitelist = ['system-addon-guid', 'var-0-guid-0', 'var-0-guid-1', 'var-0-guid-2', 'var-1-guid-0', 'var-1-guid-1', ...]
multi_clusters_df = DataFrame[client_id: string, normalized_channel: string, geo_city: array<strin...ar_parent_browser_engagement_unique_domains_count: array<struct<value:bigint>>]
def test_compute_donors(spark, addon_whitelist, multi_clusters_df):
multi_clusters_df.createOrReplaceTempView("longitudinal")
# Perform the clustering on our test data. We expect
# 3 clusters out of this and 10 donors.
> _, donors_df = taar_similarity.get_donors(spark, 3, 10, addon_whitelist, random_seed=42)
tests/test_taar_similarity.py:263:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
mozetl/taar/taar_similarity.py:151: in get_donors
clusters = compute_clusters(addons_df, num_clusters, random_seed)
mozetl/taar/taar_similarity.py:101: in compute_clusters
model = pipeline.fit(addons_df)
.tox/py27/local/lib/python2.7/site-packages/pyspark/ml/base.py:132: in fit
return self._fit(dataset)
.tox/py27/local/lib/python2.7/site-packages/pyspark/ml/pipeline.py:109: in _fit
model = stage.fit(dataset)
.tox/py27/local/lib/python2.7/site-packages/pyspark/ml/base.py:132: in fit
return self._fit(dataset)
.tox/py27/local/lib/python2.7/site-packages/pyspark/ml/wrapper.py:288: in _fit
java_model = self._fit_java(dataset)
.tox/py27/local/lib/python2.7/site-packages/pyspark/ml/wrapper.py:285: in _fit_java
return self._java_obj.fit(dataset._jdf)
.tox/py27/local/lib/python2.7/site-packages/py4j/java_gateway.py:1160: in __call__
answer, self.gateway_client, self.target_id, self.name)
.tox/py27/local/lib/python2.7/site-packages/pyspark/sql/utils.py:63: in deco
return f(*a, **kw)
Please close this as the TAAR team will rewrite the job to move off of longitudinal and to use clients_daily table instead.
https://github.com/mozilla/taar/issues/115
@acmiyaguchi I think this issue can be closed as current (refactored) job no longer triggers this behaviour.