eyecite icon indicating copy to clipboard operation
eyecite copied to clipboard

Handling white spaces in journal names.

Open bbernicker opened this issue 3 years ago • 3 comments

While testing Eyecite today, I noticed that there were some citations to law reviews in my dataset which are missing a space in L.Rev. and/or between the name of the law review and L.Rev. or L. Rev.. For an example, Strickland v. Washington cites 58 N.Y.U.L.Rev. 299; 83 Colum.L.Rev. 1544; 93 Harv.L.Rev. 752; and 50 U.Chi.L.Rev. 138.

I was curious whether ignoring white spaces in the names of journals (and maybe reporters and laws for that matter) would help improve detection (especailly with OCR'd files). Alternatively, does it make sense to specify alternative versions of journal names without some and/or all of its white spaces? Or else to change "L.Rev." to "L. Rev." in Eyecite's clean module?

bbernicker avatar Aug 09 '22 22:08 bbernicker

I haven't looked at the code for this specifically, but yeah, some sort of solution is needed. I forget how journal names are identified (I think a regex?). In general, it's easier to tweak our journal/statute/citation-specific regex than it is to do things like whitespace stripping (which tends to be less granular).

mlissner avatar Aug 10 '22 21:08 mlissner

Maybe I could go through the regex and replace " L. Rev." with "\s?L.\s*Rev." This would allow a match whether or not there is one space before the L. and whenever L. and Rev. are separated by nothing or nothing except white space. It would not match journal names with missing spaces unless they have L. Rev. in them (e.g. "Admin. L.J. Am. U." would match, but not Admin.L.J.Am.U."), but it would at least be a step in the right direction.

bbernicker avatar Aug 12 '22 16:08 bbernicker

@flooie Can you take over review on this one, please? (Sorry @bbernicker I just know he'll have better opinions on this codebase.)

mlissner avatar Aug 12 '22 16:08 mlissner