NotesReview icon indicating copy to clipboard operation
NotesReview copied to clipboard

Cleanup linkify regex patterns (efficiency)

Open Lee-Carre opened this issue 3 years ago โ€ข 5 comments

Inspired by ENT8R/NotesReview#65:

  • Avoid groups being capture-groups (memory-usage)
  • more compact file-extension matching (especially similar extensions)
  • replace .+ with more specific URL-only set
  • match more possible variations (robustness & future-proofing)

Lee-Carre avatar Feb 24 '22 14:02 Lee-Carre

Having read & thought a little more, I now suspect that my making all groups non-capturing will sabotage the next block of code for applying special treatment to some URLs.

I'll undo this and see if I can figure out (on a phone) how to add those commits to this PR (else I'll close this and open anew).

Edit: done.

Lee-Carre avatar Feb 24 '22 14:02 Lee-Carre

@ENT8R

I'm working on refining the patterns further.

To make the patterns more specific (matching only valid URLs (avoiding false-positives), and doing so efficiently (since each pattern is run against each queried (OSM-)note)), it would be helpful to use backreferences.

  • Which regex parser / interpreter does your backend use? I would guess JavaScript, given that's the language of linkify.js.
  • If you already know; does it implement backreferences, and if so via what syntax? If you don't happen to recall, that's fine; I'll gladly read the documentation for the answer to the first question ๐Ÿ™‚.

Lee-Carre avatar Feb 24 '22 17:02 Lee-Carre

Apologies for the multiple related commits; I'm doing all this on a phone via GH's Web UI.

Lee-Carre avatar Feb 24 '22 20:02 Lee-Carre

Thank you for your work so far, I did not look into the changes in detail yet, as I wanted to answer your questions before:

Having read & thought a little more, I now suspect that my making all groups non-capturing will sabotage the next block of code for applying special treatment to some URLs.

Correct, in some cases, the URLs are transformed to a different URL to provide a smaller image/thumbnail or to map a webpage to the correct image (needed for some Imgur links). As you already noticed, this is done via capture groups.

Which regex parser / interpreter does your backend use? I would guess JavaScript, given that's the language of linkify.js.

If you mean the language the user interface is implemented in and where the image linking takes place (linkify.js), JavaScript is the language being used.

If you already know; does it implement backreferences, and if so via what syntax?

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions and especially https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Groups_and_Ranges are useful if you are interested in further documentation. I also found another page which offers a quick overview of the possibilities: https://javascript.info/regexp-backreferences

ENT8R avatar Feb 27 '22 00:02 ENT8R

Thank you for your work so far

Welcome; contributing what I'm able to ๐Ÿ™‚. Opportunity to learn.

I did not look into the changes in detail yet

Don't worry about review, yet. I'll change the status of the PR (from draft to open; I presume that'll notify you(?), since you're likely โ€˜watchingโ€™ โ€˜all activityโ€™) when I feel it's ready for scrutiny & consideration for merging.

Correct, in some cases, the URLs are transformed to a different URL to provide a smaller image/thumbnail or to map a webpage to the correct image (needed for some Imgur links). As you already noticed, this is done via capture groups.

๐Ÿ‘ an elegant approach.

The necessary groups are now capturing, again. Embarrassment averted.

If you mean the language the user interface is implemented in and where the image linking takes place (linkify.js), JavaScript is the language being used.

Ah, it happens client-side; OK. Either way, thanks for the confirmation of which flavour of regex to use. I don't have the means to test changes (otherwise it would also mean that I could ensure that all the changes would be in a single commit), so I want to be extra-careful that what I submit is at least valid syntax.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions and especially https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Groups_and_Ranges are useful if you are interested in further documentation. I also found another page which offers a quick overview of the possibilities: https://javascript.info/regexp-backreferences

Thankyou muchly. Mozilla does a decent job of documentation ๐Ÿ™‚.

Besides backreferences for deduplicating code (some parts of e.g. wiki URLs repeat), I want to use named backreferences (or, rather, backreferences to named groups) for easier readability (I'm thinking of when future changes are necessary). Almost like comments in code, or well-named variables & functions.

I'll try to do that with fewer commits, though ๐Ÿ˜‹.

Lee-Carre avatar Feb 27 '22 21:02 Lee-Carre