Unaligned predicted spans are ignored in `Scorer.score_spans`
How to reproduce the behaviour
I originally encountered this issue when attempting to score a simple regex-based span prediction approach as a baseline. My reference documents contain 286 spans and the predicted documents contain 696 spans. Clearly the regex-based approach is overpredicting and should have a maximum precision of 286 / 696 = 41%.
However, when I run these documents through Scorer.score_spans, it measures a precision score of 58%!
I believe that I was able to trace the issue to how Example.get_aligned_spans_x2y behaves and how it is used in Scorer.score_spans. Scorer.score_spans works by creating two sets: gold_spans and pred_spans, containing the reference and predicted spans for each document. In order to ensure that these spans can easily compared, the predicted spans are aligned to the tokenization of the reference document before they're added to the pred_spans doc. This is done by calling Example.get_aligned_spans_x2y:
https://github.com/explosion/spaCy/blob/ddffd096024004f27a0dee3701dc248c4647b3a7/spacy/scorer.py#L411-L413
The aligned spans are immediately iterated over and added to the pred_spans set. This works fine if the predicted spans can be aligned to the reference document. However, there are cases where it's not possible to exactly align the predicted spans to the reference document. For example, if we have:
- Reference document: ["AAABBBCCC"]
- Predicted document: ["AAA", "BBB", "CCC"] (where BBB is a predicted span)
Then it's not possible to represented the predicted span (BBB) as a contiguous sequence of whole tokens in the reference document.
When this happens, Example.get_aligned_spans_x2y skips the span entirely! I believe that this specifically occurs when Example._get_aligned_spans checks that the text of the newly aligned span matches the text of the original, unaligned span:
https://github.com/explosion/spaCy/blob/ddffd096024004f27a0dee3701dc248c4647b3a7/spacy/training/example.pyx#L315-L319
This behavior is understandable, since there's not really a "right" thing to do here, but in the context of Scorer.score_spans, this results in unaligned predicted spans being ignored. As a result, these spans aren't treated as false positives as they should be.
Here's a code snippet that demonstrates this behavior with some constructed examples:
from typing import Optional
import spacy
def create_annotated_doc(nlp: spacy.language.Language, text: str, span_char_range: Optional[tuple[int, int]]) -> spacy.tokens.Doc:
"""Create a doc with an optional span, retokenizing if necessary."""
# Convert the text to a doc
doc = nlp(text)
# Initialize the span group container
doc.spans["sc"] = []
# If no span was requested, return immediately
if span_char_range is None:
return doc
# Extract the span attributes
span_start, span_end = span_char_range
# Check if the span is aligned with the doc's tokens
token_starts = {token.idx for token in doc}
token_ends = {token.idx + len(token.text) for token in doc}
if span_start not in token_starts or span_end not in token_ends:
# The span is not aligned with doc tokens, so we need to retokenize the doc
# The general case is too complex for this example, so here we'll assume the doc has only a single
# token with no token trailing spaces
if len(doc) != 1 or doc[0].whitespace_:
raise ValueError("Can't handle doc with more than one token or token trailing whitespace.")
# Retokenize the predicted document into three tokens
with doc.retokenize() as retokenizer:
# Grab the first (and only) token of the doc
token = doc[0]
# Define where we'll split the token
split_indices = [span_start, span_end]
left_indices = [None] + split_indices
right_indices = split_indices + [None]
# Split the token
orths = [token.text[left:right] for left, right in zip(left_indices, right_indices)]
heads = [(token, index) for index in range(len(split_indices) + 1)]
retokenizer.split(token, orths, heads, {})
# Create the span
doc.spans["sc"].append(doc.char_span(span_start, span_end))
return doc
# Create an empty pipeline
nlp = spacy.blank("en")
# Create an example with aligned spans
aligned_reference = create_annotated_doc(nlp, "AAA", (0, 3))
aligned_predicted = create_annotated_doc(nlp, "AAA", (0, 3))
aligned_example = spacy.training.Example(reference=aligned_reference, predicted=aligned_predicted)
# Create an example with unaligned spans
unaligned_reference = create_annotated_doc(nlp, "AAABBBCCC", (0, 3))
unaligned_predicted = create_annotated_doc(nlp, "AAABBBCCC", (3, 6))
unaligned_example = spacy.training.Example(reference=unaligned_reference, predicted=unaligned_predicted)
# Score the examples
scores = spacy.scorer.Scorer.score_spans([aligned_example, unaligned_example], "sc", getter=lambda doc, attr: doc.spans[attr])
# Display the scores
print(scores)
In this example, we should have 1 true positive, 1 false negative and 1 false positive, giving us a precision of 0.5 and recall of 0.5. However, on my machine this script prints:
$ poetry run python test.py
{'sc_p': 1.0, 'sc_r': 0.5, 'sc_f': 0.6666666666666666, 'sc_per_type': {'': {'p': 1.0, 'r': 0.5, 'f': 0.6666666666666666}}}
Incorrectly giving an inflated precision of 1.0.
There are two ways I can think of that this could be fixed:
- Instead of immediately iterating over the output of
Example.get_aligned_spans_x2y, somehow identify "unalignable" spans (?) and count them as false positives in the appropriate categories - Create a shared alignment that can represent spans from both reference and predicted documents and use this shared alignment to align and compare reference and predicted spans
Or maybe there are other ways that would be easier! 😄
Thanks for maintaining SpaCy! 🎉
Your Environment
- spaCy version: 3.6.0
- Platform: macOS-13.4.1-arm64-arm-64bit
- Python version: 3.11.2
- Environment Information: Poetry 1.5.1 virtual environment
Hey @connorbrinton, thanks for the detailed report.
I definitely understand the issue. And yea, an easy solution could be to count the unaligned ones as false positives (or at least make that an option). We'll have a closer look at this 🙏
I can see why this was confusing/unexpected, and I agree that there isn't really one single right thing to do here. If the tokenization doesn't line up, you just don't know whether the prediction could have been correct or not. A general example would be something like nested NER spans where your reference data only contains the outermost span due to tokenization and you just don't know whether a nested predicted span is correct or not, like a pretend German hyphenated [ORG [PER Name]-Stiftung] (foundation), so if your predicted tokenization splits on - and you predict the PER span is it a false positive?
I don't think I'd want to change the current default behavior here. I could possibly see adding a option to do this, but I wouldn't be hugely in favor of it because I think it gets hard to interpret and it could be difficult to explain to users. I might suggest using a custom scorer instead if you know that all of these are false positives on your end? Or oversplitting a bit in your tokenizer to have the data align better if you want to use the built-in scorers? (In general I might have suggested retokenizing before scoring in order not to have to change your models, but be aware that doc.spans isn't currently handled during retokenization, the open issue is #12024.)
I can definitely understand wanting to preserve the current behavior. Changing this could potentially affect any evaluations where the tokenization is different between the reference and predicted documents.
Would it make sense to print a warning when an unaligned predicted span is ignored during evaluation? This would allow the user to recognize that the evaluation may not be measuring exactly what they expected.
Yeah, I think there needs to be some way to highlight when the tokenization may cause problems for further evaluation steps. For token-level features like .tag any misaligned tokens are counted as wrong, so the tokenizer performance is an upper limit for the tagger performance. But for spans, I can't think of a good way to do the same thing in the general case.
If you run Language.evaluate you always get a tokenization evaluation, but it's also possible that the tokenizer performance is only problematic in places that don't matter at all for your particular spans, or vice versa.
Let us think about this a bit more, or let us know if you have any additional ideas!
As a note, I think warnings get tricky because they'd show up in spacy train output.