CSV: match group references get mixed up when multiple ifs match a record
When reading CSV and rules files and using REGEX expressions, if more than one if block matches the same record, the REGEX group index gets mixup.
Example:
- CSV file:
date,desc
2019-01-01,SUFFIX First
2019-02-01,SUFFIX Text 1 - Text 2
- Rules file:
skip 1
fields date,desc
separator ,
date-format %Y-%m-%d
description Test
if %desc SUFFIX (.*)
account1 account:\1
amount1 1
account2 equity
if %desc SUFFIX (.*) - (.*)
account1 account:\1:\2
amount1 1
account2 equity
Expected Result
2019-01-01 Test
account:First 1
equity
2019-02-01 Test
account:Text 1:Text 2 1
equity
Current Result
2019-01-01 Test
account:First 1
equity
2019-02-01 Test
account:Text 1 - Text 2:Text 1 1
equity
Note
If the rules file is modified with the following:
if %desc SUFFIX (.*) - (.*)
account1 account:\2:\3
amount1 1
account2 equity
(note the index for matching groups (\2 and \3), then the result works as expected.
More info
HLedger version: hledger 1.32.2, mac-aarch64
[Background: wish #2009, pr #2087. Workaround: don't allow multiple ifs with capture groups to match the same CSV record. So if you use capture groups, make those regexps specific enough that at most one of them will match any given CSV record.]
Related? https://github.com/simonmichael/hledger/issues/2009
FYI: @jmtd
Interesting: I'll take a look, thank you.
I'd like to get this one in today's release, and have time to take a look if needed.
I left some comments on the original PR #2087.
the workaround for now is when you use capture groups, make those regexps specific enough that at most one of them will match any given CSV record.
I'd like to get this one in today's release, and have time to take a look if needed.
Sorry -- I've just seen your comment -- but I'm not able to take a look at the moment, I've got too much work on. I might have a chance to start looking at the end of this weekend (FOSDEM/travel back from FOSDEM), but I don't know how long it will take to fully address.
No worries @jmtd; I went ahead with the release and noted a workaround on the issue.
I just bumped into what I think in the same issue myself and came here to file an issue, only to find this. The "workaround" however doesn't quite apply. My use case is I have multiple rules files, some with very generic stuff (e.g. lots of if %description matches to normalize vendor names from credit card statements). These generic ones are included in other bank specific rules files with more specialized handling. If a rule from the bank file matches for whatever reason as well as one from the vendors mapping, it's possible the capture groups get all scrambled. If a match in vendors was the only match, the capture group might be \1, but if something in the bank rules matched then it might be \2 or \3.
Ideally the back references would be numbered just inside of a single if block, not combined with all the (possibly variable) number of matches from all blocks.
Sorry you were bitten by this.
Resolving the semantics and implementation are both rather tricky here. We might need to think about removing this feature until things can be clarified.
Ideally the back references would be numbered just inside of a single if block, not combined with all the (possibly variable) number of matches from all blocks.
For what it’s worth this is how I want it to work too….
Resolving the semantics and implementation are both rather tricky here. We might need to think about removing this feature until things can be clarified.
Agreed; getting the scoping the feature wants might be quite an invasive change and take a lot of time to get right. I would support backing the current implementation out in the meantime.
Backing it out would leave me in the rather awkward position of having to keep using an old version ;-) Or potentially having to run both new and old versions in the same project depending on what new stuff lands. I have quite a bit of filters relying on this already before I ran into the limitation and would be loath to refactor it all. Not impossible of course, just awkward.
I wander if instead of gutting it, some prominent warning could be thrown if back references are used at all or if the highest number used back-ref is lower than the capture group count or something like that so so people wouldn't be stumped by the bug and would be warned the back-refs behavior will change? Is their any heuristic where a condition meriting a warning was detected that could be easily implemented without refactoring the whole thing?
I see, so it's useful enough to keep even as it stands. Then yes, we should figure out some way to warn loudly when it's triggered so users aren't baffled.
I'm glad that people (other than me) are finding this feature useful, even in its current state, that's motivating. Thank you!
It's non-trivial to back this out (cannot clean revert the patch series from the PR). Rather than spend time working on resolving revert conflicts, I was about t suggest we "unhook" the feature, but given your comment working on both a) a warning and b) ultimately fixing the problem is a better use of time IMHO.
slightly more concise test case
data: 2019-02-01,PREFIX Text 1 - Text 2
rules:
fields date,description
account2 equity
amount1 1
if %description PREFIX (.*)
account1 account:\1
if %description PREFIX (.*) - (.*)
account1 account:\1:\2
expected:
2019-02-01 PREFIX Text 1 - Text 2
account:Text 1:Text 2 1
equity -1
Actual:
2019-02-01 PREFIX Text 1 - Text 2
account:Text 1 - Text 2:Text 1 1
equity -1
I've tried to make a start on fixing this, by first capturing the problem in a unit test. I could have sworn I wrote some tests for this feature, but none are in the main branch (I'll check if I lost them in a rebase of a dev branch...). Anyway, my WIP is #2179.
I could have sworn I wrote some tests for this feature
Ah I wrote functests, not unit tests. I've extendd the functest to catch this.
OK I think I've found the solution. The point in the callstack where we have the necessary context is getEffectiveAssignment, so it's from there we need to resolve the match groups. I've got a hacky patch that hooks renderTemplate in at this stage which fixes this problem, but breaks a bunch of other tests (because renderTemplate also resolves the CSV field references, and doing so earlier on breaks test expectations).
I'm now looking at whether to a) split out the matchgroup-handling bits of renderTemplate and use those from getEffectiveAssignment, or b) determine whether it's OK to resolve the field references at this point in the call stack and if so, update the newly-failing tests.
For the record: this was fixed in hledger 1.33.
Yes thank you I saw it in release notes!