handson-ml2 [IDEA] Chapter 3: 90% recall classifier

In chapter 3 (p, 96) see how a 90% precision classifier can be created.

What about a 90% recall classifier?

One would be tempted to do (at least I was)

threshold_90_recall = thresholds[np.argmax(recalls >= 0.90)]

but a simple

recalls[np.argmax(recalls >= 0.90)] # result 1.0

would demonstrate that it's not correct.

I've tried different solutions and this is the most elegant/concise I came up with

threshold_90_recall[len(recalls) - np.argmax(recalls[::-1] >= 0.90) - 1]

Feb 01 '22 16:02 FedericoTrifoglio

Hi @FedericoTrifoglio , Indeed, the 90% precision classifier code thresholds[np.argmax(precisions >= 0.90)] only works because the precision/recall arrays correspond to strictly increasing thresholds. So the precision gradually increases (except when the threshold gets really high, as you can see in Figure 3–4, and which is explained in the note below that figure), while the recall gradually drops.

Therefore precisions >= 0.90 looks like [False, False, False, False, False, ..., False, True, True, True, True, True, ...] with perhaps a few False near the end of the array, if precision drops back down below 90% as the threshold gets really high (which is unlikely). In any case, np.argmax(precisions >= 0.90) returns the index of the first True value. Since recall drops regularly as the threshold increases, this argmax value gives the index of the highest recall for that level of precision. That's exactly what we want for the 90% precision classifier: the highest recall we can get with at least 90% precision.

But things are different for recall. Since recall drops regularly as thresholds increase, recalls >= 0.90 looks like this: [True, True, True, True, True, ..., True, True, True, False, False, False, False, ..., False, False]. So we must use np.argmin() instead of np.argmax(), to get the index of the first False value (instead of the first True, which is just 0, and corresponds the the lowest threshold, the lowest precision, and the highest recall of 100%). But this gives you the index of the first recall value below 90%, so you must use that index minus 1, to get the index of the first recall value above 90%, like this:

threshold_90_recall = thresholds[np.argmin(recalls >= 0.90) - 1]

This approach will work fine in general (and it gives the same result as yours). However, if the minimum recall is low, such as 5% (instead of 90%), then you may end up in the "bumpy" region of the precisions array, which means you won't necessarily get the highest precision with at least the desired recall. To ensure that you do, you can use this code instead:

threshold_5_recall = threshold[precisions[recalls >= 0.05].argmax()]

You can try running this code, and you will see that it gives a lower threshold than the previous code, which means a higher recall than the minimum desired recall of 5% (it gets 6.3% instead), and it also gives a higher precision of 97.15% instead of 96.45%.

Note that this last code only works because the recalls array is strictly decreasing, so recalls >= 0.05 is just a right-truncated version of recalls, so the indices of its values are the same as in recalls, and therefore using them to index into precisions works fine.

I hope this helps.

Feb 02 '22 03:02 ageron

I see what you mean with the last point.

Assuming I am willing to tolerate a recall as low as 5%, what's the highest precision I would get?

With thresholds[np.argmin(recalls >= 0.05) - 1] I would get a sub-optimal precision (second red point in the chart below), so I need to restrict the focus and take the highest precision there from within my area of interest.

Feb 02 '22 09:02 FedericoTrifoglio