ads/bibcode bug

Open tmorrell opened this issue 3 years ago • 1 comments

The example bibcode 1992ApJ...400L...1W from https://ui.adsabs.harvard.edu/help/actions/bibcode fails the validation check.

import idutils
idutils.is_ads('1992ApJ…400L…1W')
print(idutils.is_ads('1992ApJ…400L…1W'))

Jan 20 '23 18:01 tmorrell

This is an unfortunate example from their docs; the three dots in the identifier ended up formatted/printed using the ellipsis character …. On our side, we might consider normalizing strings to be more forgiving:

import unicodedata
unicodedata.normalize("NFKD", "1992ApJ…400L…1W")
# '1992ApJ...400L...1W'

For Bibcode, we can apply this for both parsing and normalizing as a fix (I can't imagine there being strange characters that would change the actually resolution value of the string).

On a side note, we should consider applying Unicode normalization to all types of identifiers possibly... Users many times copy-paste identifiers from PDF papers, or other formatted sources, which many times carry all sorts of weird characters with them (em-dashes in DOIs, zero-length spaces, etc.)

Sep 16 '23 20:09 slint