Surprising behavior: Word and Chunk objects do not implement value comparison
While computing deltas between sequences of Word or Chunk objects, it's come to my attention that these objects do not implement proper value comparison/equality testing on their __eq__() methods.
Here's an example of what I mean by 'equality testing', using the built-in container type list:
>>> l1, l2 = ["hello"], ["hello"]
>>> l1 is l2 # identity testing
False
>>> l1 == l2 # equality testing
True
And here's Pattern's Sentence object behaving as expected under value and identity comparison:
>>> from copy import copy
>>> sent = Sentence(parse("The elephant sits on the chair"))
>>> sent
Sentence('The/DT/B-NP/O elephant/NN/I-NP/O ... chair/NN/I-NP/I-PNP')
>>> sent is copy(sent) # object identity testing
False
>>> sent == copy(sent) # object value testing
True
Contrast the above with the comparison behavior of Pattern's Word and Chunk objects:
>>> sent # Reusing `sent` from the example above
Sentence('The/DT/B-NP/O elephant/NN/I-NP/O ... chair/NN/I-NP/I-PNP')
>>> word = sent.words[1] # Looking at Word object
>>> word
Word('elephant/NN')
>>> word is copy(word) # identity testing
False # good
>>> word == copy(word) # value testing
False # !!!!! unexpected
>>> chunk = sent.chunks[0]
>>> chunk
Chunk('The elephant/NP')
>>> chunk is copy(chunk) # identity testing
False # good
>>> chunk == copy(chunk) # value testing
False # !!!!! unexpected
This comparison behavior is highly surprising, since the objects in both the Chunk and the Word example are equal in terms of the values that they contain, and this is the kind of information that Python's == operator should reflect (as opposed to the separate is keyword).
I can see that the __eq__() method of both Word and Chunk implements value comparison as identity comparison. Here's the code:
def __eq__(self, <word/chunk>):
return id(self) == id(<word/chunk>)
By contrast, Sentence does this as:
def __eq__(self, other):
if not isinstance(other, Sentence):
return False
return len(self) == len(other) and repr(self) == repr(other)
I'm a big fan of the Pattern object model. However, perhaps it might be worth considering extending the latter value comparison implementation to Word and Chunk?
I would heartily upvote this! Hear, hear!