spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

ValueError: [E102] Can't merge non-disjoint spans. - Dutch

Open kilzone opened this issue 3 years ago • 3 comments

Error message

ValueError: [E102] Can't merge non-disjoint spans. 'opvlamming' is already part of tokens to merge. If you want to find the longest non-overlapping spans, you can use the util.filter_spans helper: https://spacy.io/api/top-level#util.filter_spans

Code (use case)

Currently we're trying to use spaCy 3.3 to parse Dutch (nl_core_news_lg) texts - I would like the nouns to be merged again, but the error message above comes up. I've noticed several issues about this issue with German language, but not sure if Dutch has also been fixed.

Code:

parser = spacy.load('nl_core_news_lg')
parser.add_pipe("merge_noun_chunks")
parser('okt opvlamming panuveitiss Oculo sinistra > Oculo dextra en alghele malaise moe, tattoos opgezet. 17-03-2020 Ozurdex Oculo dextra et sinistra , opvlamming onder humir luchtweg ondanks wekelijks ada vele corpus vitreum (glasvocht) troebelingen beschreven')

Info about spaCy

  • spaCy version: 3.3.0
  • Platform: Windows-10-10.0.25120-SP0
  • Python version: 3.7.9
  • Pipelines: nl_core_news_lg (3.3.0)

kilzone avatar May 25 '22 08:05 kilzone

I'm not sure if you are still having this problem, but if you use the nl_core_news_sm, it will work. I'm unsure how to solve this issue because I'm unfamiliar with spaCy's code base.

The issue is here: spacy/pipeline/functions.py

    with doc.retokenize() as retokenizer:
        for np in doc.noun_chunks:
            attrs = {"tag": np.root.tag, "dep": np.root.dep}
            retokenizer.merge(np, attrs=attrs)  # type: ignore[arg-type]
    return doc

The np in doc.noun_chuncs returns a Span of 9 words. "Oculo dextra en alghele malaise moe, tattoos opgezet." and this breaks the retokenizer.merge(np, attrss=attrs) because it is expecting one Span. If someone has suggestions on how to solve it or where to investigate further, I'd be happy to take a look at it.

abdulrahimq avatar Aug 02 '22 04:08 abdulrahimq

Most of the noun chunk iterators check for overlapping spans, but this seems to be missing for Dutch. A PR that adds this would be welcome, in general the check(s) could look similar to this check for English:

https://github.com/explosion/spaCy/blob/2d89dd9db898e66058bf965e1b483b0019ce1b35/spacy/lang/en/syntax_iterators.py#L34-L36

adrianeboyd avatar Aug 02 '22 14:08 adrianeboyd

I think I have fixed it. I'm not fully sure if it makes sense because the code is very new to me. I'm trying to figure out the compilation and whatnot before I do a PR to fix this.

abdulrahimq avatar Aug 02 '22 20:08 abdulrahimq

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

github-actions[bot] avatar Sep 10 '22 00:09 github-actions[bot]