squad "completed" count wrong in latest builds view

screenshot from 2018-04-05 13-53-51

Apr 05 '18 19:04 danrue

I downloaded the latest db and ran through the code. These are 2 different things. The green 'token' comes from ProjectStatus and reflects the number of TestRuns which ended in 'expected' state. When creating a TestRun, TestJob is used as a base. But the status from LAVA is discarded as it's sometimes misleading (or irrelevant). So TestRun is considered 'completed' when SQUAD was able to retrieve something from it and the state looks sane.

On the right, the status displayed comes directly from LAVA. The red X comes from the fact that LAVA returned error at some point. It could be while submitting job or when retrieving results. The error can also come from the job status.

To sum up these 2 items have nothing to do with each other. I wonder how to fix the confusion. I don't want to rely on LAVA statuses as they sometimes seem a bit too aggressive. On the other hand it would simplify LAVA job handling (incomplete lava jobs would be simply ignored).

Apr 05 '18 19:04 mwasilew

I think it should rely on lava statuses (I thought it did until now), and as you say, it would simplify things.

Apr 05 '18 19:04 danrue

This might throw away quite a few of kselftests results. For example here: https://lkft.validation.linaro.org/scheduler/job/168941. Right now we collect the results that are available. If we start relying on LAVA job status all these results will be discarded.

Apr 05 '18 19:04 mwasilew

I know it seems like a good idea to take partial results, but it's kind of like writing a code without catching errors; complexity grows because of all the exotic conditions you have to deal with, which ultimately just makes things work worse, not better.

I favor simplicity over complexity. After all, the biggest issue that qa-reports has as a product is confusion and complexity. Reducing either moves us forward.

I didn't know about that issue with kselftest. Do you have any idea how prevalent it is?

Apr 05 '18 19:04 danrue

If we go with this assumption we'll discard any test jobs that fail half way through and that log the results 'live'. We'll simply assume that all tests have to complete or we discard the full set.

ksefltests are a bit better right now, but they used to 'hang' on the zram.sh test (last one). It was completing but not exiting from the script. In LAVA this is classified as 'Incomplete'. We collect the results 'live' in case of kselftests, so this problem manifests. We don't do that in any other case right now.

Apr 05 '18 20:04 mwasilew

Please correct me if I'm wrong, but I think this issue will also be settled with this PR: https://github.com/Linaro/squad/pull/579

It'd be nice to have a doc page on testjob/build states...

Aug 15 '19 12:08 chaws

@chaws I'm not 100% sure #579 solves this problem. Let's roll it out and check.

Aug 15 '19 12:08 mwasilew

Just checked for this issue and it seems it's still happening.

Here it shows 21 incomplete test jobs: https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.10.y/build/v5.10.4-64-g18347c4f0781/testjobs/?name=&job_status=Incomplete&submitted=1&fetched=1&job_id=&environment=&has_errors=1#!#collapseOne

But in the main project page, it shows 35

Jan 13 '21 18:01 chaws