edsnlp icon indicating copy to clipboard operation
edsnlp copied to clipboard

Silent deduplication of entities

Open aricohen93 opened this issue 5 months ago • 0 comments

Hi @percevalw , The default behaviour of the get_spans produce a loss of entities when writing documents to disk. I suggest to add a deduplicate argument to converters with default value to False.

For example, here the get_spans function deduplicate values and therefore less entities than expected are written to disk.

https://github.com/aphp/edsnlp/blob/879e34034cebc77ab8d58dd00981f61a3a00e838/edsnlp/data/converters.py#L612

Additionally, this line is also dropping duplicate spans :
https://github.com/aphp/edsnlp/blob/879e34034cebc77ab8d58dd00981f61a3a00e838/edsnlp/data/converters.py#L645

I suggest to replace it by:

for i, ent in enumerate(sorted(spans))

aricohen93 avatar Nov 09 '25 12:11 aricohen93