hledger icon indicating copy to clipboard operation
hledger copied to clipboard

CSV: match group references get mixed up when multiple ifs match a record

Open amoralesch opened this issue 2 years ago • 7 comments

When reading CSV and rules files and using REGEX expressions, if more than one if block matches the same record, the REGEX group index gets mixup.

Example:

  • CSV file:
date,desc
2019-01-01,SUFFIX First
2019-02-01,SUFFIX Text 1 - Text 2
  • Rules file:
skip          1

fields        date,desc
separator     ,
date-format   %Y-%m-%d
description   Test

if %desc SUFFIX (.*)
  account1      account:\1
  amount1       1
  account2      equity

if %desc SUFFIX (.*) - (.*)
  account1      account:\1:\2
  amount1       1
  account2      equity

Expected Result

2019-01-01 Test
    account:First               1
    equity

2019-02-01 Test
    account:Text 1:Text 2               1
    equity

Current Result

2019-01-01 Test
    account:First               1
    equity

2019-02-01 Test
    account:Text 1 - Text 2:Text 1               1
    equity

Note

If the rules file is modified with the following:

if %desc SUFFIX (.*) - (.*)
  account1      account:\2:\3
  amount1       1
  account2      equity

(note the index for matching groups (\2 and \3), then the result works as expected.

More info

HLedger version: hledger 1.32.2, mac-aarch64

[Background: wish #2009, pr #2087. Workaround: don't allow multiple ifs with capture groups to match the same CSV record. So if you use capture groups, make those regexps specific enough that at most one of them will match any given CSV record.]

amoralesch avatar Jan 21 '24 21:01 amoralesch

Related? https://github.com/simonmichael/hledger/issues/2009

amoralesch avatar Jan 21 '24 21:01 amoralesch

FYI: @jmtd

amoralesch avatar Jan 21 '24 21:01 amoralesch

Interesting: I'll take a look, thank you.

jmtd avatar Jan 22 '24 10:01 jmtd

I'd like to get this one in today's release, and have time to take a look if needed.

simonmichael avatar Jan 27 '24 17:01 simonmichael

I left some comments on the original PR #2087.

the workaround for now is when you use capture groups, make those regexps specific enough that at most one of them will match any given CSV record.

simonmichael avatar Jan 28 '24 04:01 simonmichael

I'd like to get this one in today's release, and have time to take a look if needed.

Sorry -- I've just seen your comment -- but I'm not able to take a look at the moment, I've got too much work on. I might have a chance to start looking at the end of this weekend (FOSDEM/travel back from FOSDEM), but I don't know how long it will take to fully address.

jmtd avatar Jan 31 '24 10:01 jmtd

No worries @jmtd; I went ahead with the release and noted a workaround on the issue.

simonmichael avatar Jan 31 '24 16:01 simonmichael

I just bumped into what I think in the same issue myself and came here to file an issue, only to find this. The "workaround" however doesn't quite apply. My use case is I have multiple rules files, some with very generic stuff (e.g. lots of if %description matches to normalize vendor names from credit card statements). These generic ones are included in other bank specific rules files with more specialized handling. If a rule from the bank file matches for whatever reason as well as one from the vendors mapping, it's possible the capture groups get all scrambled. If a match in vendors was the only match, the capture group might be \1, but if something in the bank rules matched then it might be \2 or \3.

Ideally the back references would be numbered just inside of a single if block, not combined with all the (possibly variable) number of matches from all blocks.

alerque avatar Feb 24 '24 21:02 alerque

Sorry you were bitten by this.

Resolving the semantics and implementation are both rather tricky here. We might need to think about removing this feature until things can be clarified.

simonmichael avatar Feb 24 '24 22:02 simonmichael

Ideally the back references would be numbered just inside of a single if block, not combined with all the (possibly variable) number of matches from all blocks.

For what it’s worth this is how I want it to work too….

Resolving the semantics and implementation are both rather tricky here. We might need to think about removing this feature until things can be clarified.

Agreed; getting the scoping the feature wants might be quite an invasive change and take a lot of time to get right. I would support backing the current implementation out in the meantime.

jmtd avatar Feb 25 '24 12:02 jmtd

Backing it out would leave me in the rather awkward position of having to keep using an old version ;-) Or potentially having to run both new and old versions in the same project depending on what new stuff lands. I have quite a bit of filters relying on this already before I ran into the limitation and would be loath to refactor it all. Not impossible of course, just awkward.

I wander if instead of gutting it, some prominent warning could be thrown if back references are used at all or if the highest number used back-ref is lower than the capture group count or something like that so so people wouldn't be stumped by the bug and would be warned the back-refs behavior will change? Is their any heuristic where a condition meriting a warning was detected that could be easily implemented without refactoring the whole thing?

alerque avatar Feb 25 '24 18:02 alerque

I see, so it's useful enough to keep even as it stands. Then yes, we should figure out some way to warn loudly when it's triggered so users aren't baffled.

simonmichael avatar Feb 25 '24 20:02 simonmichael

I'm glad that people (other than me) are finding this feature useful, even in its current state, that's motivating. Thank you!

It's non-trivial to back this out (cannot clean revert the patch series from the PR). Rather than spend time working on resolving revert conflicts, I was about t suggest we "unhook" the feature, but given your comment working on both a) a warning and b) ultimately fixing the problem is a better use of time IMHO.

jmtd avatar Feb 27 '24 08:02 jmtd

slightly more concise test case

data: 2019-02-01,PREFIX Text 1 - Text 2 rules:

fields        date,description
account2      equity
amount1       1
if %description PREFIX (.*)
 account1      account:\1
if %description PREFIX (.*) - (.*)
 account1      account:\1:\2

expected:

2019-02-01 PREFIX Text 1 - Text 2
    account:Text 1:Text 2               1
    equity                                      -1

Actual:

2019-02-01 PREFIX Text 1 - Text 2
    account:Text 1 - Text 2:Text 1               1
    equity                                      -1

jmtd avatar Feb 27 '24 21:02 jmtd

I've tried to make a start on fixing this, by first capturing the problem in a unit test. I could have sworn I wrote some tests for this feature, but none are in the main branch (I'll check if I lost them in a rebase of a dev branch...). Anyway, my WIP is #2179.

jmtd avatar Feb 27 '24 21:02 jmtd

I could have sworn I wrote some tests for this feature

Ah I wrote functests, not unit tests. I've extendd the functest to catch this.

jmtd avatar Feb 28 '24 13:02 jmtd

OK I think I've found the solution. The point in the callstack where we have the necessary context is getEffectiveAssignment, so it's from there we need to resolve the match groups. I've got a hacky patch that hooks renderTemplate in at this stage which fixes this problem, but breaks a bunch of other tests (because renderTemplate also resolves the CSV field references, and doing so earlier on breaks test expectations).

I'm now looking at whether to a) split out the matchgroup-handling bits of renderTemplate and use those from getEffectiveAssignment, or b) determine whether it's OK to resolve the field references at this point in the call stack and if so, update the newly-failing tests.

jmtd avatar Feb 29 '24 09:02 jmtd

For the record: this was fixed in hledger 1.33.

simonmichael avatar May 12 '24 03:05 simonmichael

Yes thank you I saw it in release notes!

alerque avatar May 12 '24 10:05 alerque