httparchive.org icon indicating copy to clipboard operation
httparchive.org copied to clipboard

November 2020 dataset contains duplicate wptids

Open Themanwithoutaplan opened this issue 5 years ago • 4 comments

I'm not quite sure where this bug belongs but I discovered that a very few of the tests in November 2020 contain duplicate test ids.

201112_Dx0_59 ,  201112_Dx0_9R ,  201112_Mx0_2C ,  201112_Mx0_34 ,  201112_Mx0_3B ,  201112_Mx0_3H ,  201112_Mx0_4N ,  201112_Mx0_6Q ,  201112_Mx0_7 ,   201112_Mx0_7X ,  201112_Mx0_A4 ,  201112_Mx0_N```

I check the pages dataset and all rows contain the same values for the same tests. As it's no longer possible to access the tests directly on the httparchive WPT instance I'm not able to determine which website has been given the wrong test id.

Themanwithoutaplan avatar Jan 12 '21 17:01 Themanwithoutaplan

@pmeenan I'd expect URLs like https://webpagetest.httparchive.org/result/201101_Mx10_1YGR/ to show the test results but I'm seeing "test not found" errors for the handful of test IDs I've tried from November and December crawls.

As for the general issue of IDs being duplicated, that's weird and worth investigating.

rviscomi avatar Jan 12 '21 18:01 rviscomi

FWIW it the respective websites on legacy do seem to have different values. Is it running big queries for each website? I stumbled across the problem as I was testing to see if I could use the wptid as the primary key in my database.

Themanwithoutaplan avatar Jan 13 '21 10:01 Themanwithoutaplan

Sorry about the delay - it's been on my radar to look into but I haven't had a chance.

1 - Duplicate ID's in theory shouldn't happen but if there is a problem with flock() or some crazy race condition right at midnight it might happen. I can switch to use "private" ID's which are random instead of sequentially numbered which might mitigate it but I'd like to understand the root cause first. Do we have a sense for how often it happened? I saw "very few" but if it's not more than a handful and they all reported "success" I'm not sure it's worth spending a LOT of time on.

2 - Missing tests are a bit more concerning. We currently archive to IA in batches of 10k tests because the system REALLY doesn't like lots of smaller files. We used to wait to verify the zip became available before deleting the tests but it can take a few days for IA to process the zips and we couldn't keep tests around that long so now as long as the upload is successful we delete the test. If it fails to process it gets lost. I want to move the archiving to the Google storage bucket instead which will simplify the logic significantly (no more need to batch them in groups of 10k tests) but it'll take a few days of work to re-plumb the archiving and restoring logic (and is probably best done between crawls).

pmeenan avatar Jan 14 '21 23:01 pmeenan

Duplicate ids affect only a few tests (mainly) and only in November. For my purposes it would be sufficient to know which test refers to which website so I can remove the others. Or, alternatively, add the correct id: it looks at least like the legacy reports themselves may have the correct metrics.

My random tests suggest that that most tests since October are missing. I don't know whether they're really missing or whether it's a just a frontend issue. I would expect you'd be able to check this more easily than an external crawler just checking the responses.

I've just started importing the January data and will run some tests later. BTW. I suspect it's completely unrelated but tests now seem to complete several days earlier.

Themanwithoutaplan avatar Jan 20 '21 12:01 Themanwithoutaplan