hledger icon indicating copy to clipboard operation
hledger copied to clipboard

Proposal: Import retroactively added transactions from CSV files

Open lfos opened this issue 5 months ago • 14 comments

This is a concrete proposal to fix #1943. Without knowing much about how the import functionality is currently implemented (and knowing little about Haskell in general), I'd expect this to be relatively straightforward to add -- even though the description below is a bit lengthy. I grouped the proposed changes into three logical steps below but it might be easier to implement them all at once.

I'm looking forward to feedback!

Goals

  1. Fix import of CSV files that are only "mostly append-only", i.e., where existing entries are immutable and new entries can be inserted anywhere within the last X days.
  2. Maintain backwards-compatibility and keep the import framework as simple as possible.

Why? It is common for some types of accounts (e.g., credit cards) to have transactions be posted with a delay of a few days. The delay may vary for different transactions within the same account.

Change 1 - Keep more context in the .latest state

A new keep-extra-state rule is added to the set of allowed CSV rules. It controls how many extra days are stored in the state file. The default is 0, i.e., the current behavior. The resulting file with three extra days might look as follows:

2025-09-23
2025-09-23
2025-09-22
2025-09-21
2025-09-21
2025-09-21
2025-09-20

When importing a CSV file:

  1. Iterate over CSV transactions and state file records in parallel to ensure they match. (May need to skip over some import transactions first, or skip over some state file records first, depending on which earliest entry lies further in the past).
  2. Error out if there's a mismatch and provide details in the error message (e.g., unexpected new transaction on 2025-09-21 in bank.csv).
  3. Otherwise, once all state file entries have a match, continue importing as usual and update the state file accordingly.

Why? This change results in detection (but not import) of most retroactively added entries. Entries retroactively inserted on the most recent date in the state file are not yet detected. They still result in duplicate/incorrect imports.

Alternatives:

  • Keep all imported transaction dates, and make the CSV rule a on/off toggle.
  • Support both "number of extra days" (keep-extra-state 3d) and "number of extra transactions" (keep-extra-state 50) in the syntax.
  • Store state dates in ascending order in the state file instead.

Change 2 - Add optional transaction identifiers to the state file

A new state-descriptor hledger field is added. It can be set in a rule file to store additional information with transactions. This is up to the user; and how to best do this depends on the information available in the CSV file. E.g., if the bank provides a stable and unique transaction ID, state-descriptor %transaction_id may be a good approach. For other banks, state-descriptor %description %amount may be good enough.

It is not important that the descriptor is unique; what matters is that two entries on the same date with the same descriptor are interchangeable when importing. For example, if exactly the same charge is made twice on a day, there will be two indentical lines in the state file, and that's okay.

The field is then stored in the state file, together with dates, e.g.,

2025-09-23 88ce31a
2025-09-23 0882689
2025-09-22 dc460da
2025-09-21 4ad72c4
2025-09-21 82231e2
2025-09-21 8e688e0
2025-09-20 1f2778a

When importing a CSV file, the pattern matching described in the previous section now also compares the descriptors. The error messages can be more detailed, e.g., referring to the exact transaction with a mismatch instead of only the date.

Why? Further improved validation of "append only" nature of CSV files. Detection of violations in all cases, including entries retroactively inserted on the most recent date in the state file.

Change 3 - Import retroactively inserted entries

Instead of matching the state file against the CSV file as a consecutive block, perform "scattered" pattern matching:

  • Iterate over the CSV records and state file in parallel (skipping a prefix as needed) but allow additional non-matching entries in the CSV file. Those non-matching entries are exactly those that end up getting imported.
  • The only requirement is that if a new date is reached in the CSV file, all state file records for the same date must have been processed. Otherwise, an error is displayed (e.g., unexpected transaction in bank.csv: <details>).

Why? Retroactively inserted transactions are now imported properly. Moreover, any unexpected changes/deletions in already imported transactions are detected.

Remark: When no transaction identifiers (as proposed in the previous section) are used, this logic automatically behaves as "append at the end of each day". This is consistent with current behavior.

lfos avatar Sep 23 '25 17:09 lfos

Note that rule/field names and other implementation details are just placeholders. I'm very much open to naming things differently or implementing things with a slightly different approach.

lfos avatar Sep 23 '25 17:09 lfos

First note: the overlap detection of import and print --new is most often used with CSV files, but it's not specific to CSV; I'll assume "input file" instead of CSV file.

"retroactively added transactions" - is "non-date-ordered transactions" an equivalent way of saying it ?

I haven't understood all in full detail, but these are interesting ideas - thanks for writing them up!

I have the impression it'll be tricky to make them correct and robust for all cases.

I understand this feature would be useful for people with this problem, but how widespread do you think the problem is ? I haven't seen it discussed often. Is there any cheaper workaround, like preprocessing such csv files in some way ?

I won't be able to work on it any time soon myself, but would be happy to see someone explore it.

simonmichael avatar Sep 23 '25 18:09 simonmichael

"retroactively added transactions" - is "non-date-ordered transactions" an equivalent way of saying it ?

PS another problem case is records which are date-ordered but not stably-ordered, ie the relative positions of records on the same day might change from one download to the next. I'm not sure if that's being addressed here.

simonmichael avatar Sep 23 '25 18:09 simonmichael

First note: the overlap detection of import and print --new is most often used with CSV files, but it's not specific to CSV; I'll assume "input file" instead of CSV file.

That's fair. I wasn't aware other formats were supported too.

"retroactively added transactions" - is "non-date-ordered transactions" an equivalent way of saying it ?

If "non-date-ordered transactions" means "a set of transactions where new transactions may be inserted over time at any position but the relative order and content of already imported transactions must never change", yes.

It's hard to fully and accurately capture that concept in just 3-4 words.

I haven't understood all in full detail, but these are interesting ideas - thanks for writing them up!

I have the impression it'll be tricky to make them correct and robust for all cases.

I spent a fair bit of thought into making things robust and flexible for any real-world cases I've seen so far, but if you (or anyone else) can spot any issues/gaps, please let me know!

I understand this feature would be useful for people with this problem, but how widespread do you think the problem is ? I haven't seen it discussed often. Is there any cheaper workaround, like preprocessing such csv files in some way ?

I think it is very common for credit cards due to the varying delays in posting/processing time, at least in North America and Europe. For example, I believe that all AmEx cards/exports behave this way. (I'll be happy to be proven wrong, my sample size is only 3). I also have imports for Visa cards that behave this way.

I don't know how common it is to use hledger to import credit card transactions in general.

In terms of workarounds, it should be possible to replicate this with a custom state file outside hleger. I might give it a shot. It's a bit unfortunate that deduplication and other import aspects would have to be handled by different tools, likely with different config files, making the overall import process more complex.

The workaround I'm currently using is adding an artificial delay of a few days to affected accounts (see this comment). But that (1) introduces an unnecessary delay in processing/tracking, (2) only works for most - not all - cases as posting can sometimes take up to 30 days and (3) makes reconciliation (i.e., comparing to real-world account balance) quite difficult.

PS another problem case is records which are date-ordered but not stably-ordered, ie the relative positions of records on the same day might change from one download to the next. I'm not sure if that's being addressed here.

I haven't seen this across the 20+ accounts I currently sync with ledger and I'd expect this to be much less common. Are you aware of any real-world examples of this?

lfos avatar Sep 23 '25 19:09 lfos

My two cents as a long-time user and occasional contributor: I feel that any use-case that requires repeated imports of the same input file would be much easily solved in a way that does not involve import at all.

For example, you could hledger print --rules ... the input CSV file into a journal file, and then include the resulting journal into your main journal. This easily lends itself to automation, does not require any duplicate detection, and will work across a wide range of changes/reordering of records in the input file as well.

Anecdotal evidence is just this -- anectodal, but this approach work very well for me in practice.

adept avatar Sep 23 '25 19:09 adept

My two cents as a long-time user and occasional contributor: I feel that any use-case that requires repeated imports of the same input file would be much easily solved in a way that does not involve import at all.

For example, you could hledger print --rules ... the input CSV file into a journal file, and then include the resulting journal into your main journal. This easily lends itself to automation, does not require any duplicate detection, and will work across a wide range of changes/reordering of records in the input file as well.

Anecdotal evidence is just this -- anectodal, but this approach work very well for me in practice.

The problem is that many banks only allow you to export X days of recent transaction history. You're not importing "the same" input file repeatedly, but many incomplete CSV files with partial overlaps that together make up the full list of transactions.

lfos avatar Sep 23 '25 19:09 lfos

The problem is that many banks only allow you to export X days of recent transaction history. You're not importing "the same" input file repeatedly, but many incomplete CSV files with partial overlaps that together make up the full list of transactions.

I suppose I can see how this could be a problem when X is quite small. However, I've yet to find a case when this is a problem in practice. For instance, for AmEx it is possible to export the last 3 (or is it 4?) years of transactions, so exporting the "current year to date" throughout the year, and then at the start of the year having to export the previous year once to make sure that nothing was missed is enough.

For credit cards that provide monthly statements (either PDF or CSV), I often dont bother to import data at my own pace, I just import the statements that the card issuer provides - and know that there would be no overlap/duplication between them.

More generally, I dont think I ever saw a credit card issuer that would not allow me to export at least the last year of transactions, but I suppose there is no reason to assume that they all would be at least this reasonable :)

adept avatar Sep 23 '25 20:09 adept

More generally, I dont think I ever saw a credit card issuer that would not allow me to export at least the last year of transactions, but I suppose there is no reason to assume that they all would be at least this reasonable :)

Interesting. I just checked, and while not all accounts allow me to do that, at least those with the "retroactively added transactions"/"non-date-ordered transactions" issue do.

How do you deal with classifying expenses (and other accounts)? I have most transactions auto-categorized via CSV rules but sometimes need to make edits or add account names where auto-categorization failed. Do you have automation to transfer that information from the previous import? (If yes, how does it work?)

Thanks for the suggestion!

lfos avatar Sep 23 '25 22:09 lfos

Thanks for the comments. Some more:

"retroactively added transactions" - is "non-date-ordered transactions" an equivalent way of saying it ?

If "non-date-ordered transactions" means "a set of transactions where new transactions may be inserted over time at any position but the relative order and content of already imported transactions must never change", yes.

It's hard to fully and accurately capture that concept in just 3-4 words.

Yes it is!

A simple list of CSV records downloaded repeatedly has quite a few possible "problematic" behaviours.

I understand this feature would be useful for people with this problem, but how widespread do you think the problem is ? I haven't seen it discussed often. Is there any cheaper workaround, like preprocessing such csv files in some way ?

I think it is very common for credit cards due to the varying delays in posting/processing time, at least in North America and Europe. For example, I believe that all AmEx cards/exports behave this way. (I'll be happy to be proven wrong, my sample size is only 3). I also have imports for Visa cards that behave this way.

I haven't been downloading from credit cards much. My sample size is only 1, a chase card, where I haven't seen this problem.

I don't know how common it is to use hledger to import credit card transactions in general.

I would guess it's very common.

In terms of workarounds, it should be possible to replicate this with a custom state file outside hleger. I might give it a shot. It's a bit unfortunate that deduplication and other import aspects would have to be handled by different tools, likely with different config files, making the overall import process more complex.

Preprocessing is quite easy to do from hledger rules now. But even if cumbersome scripts are required, prototyping and validating the approach outside hledger first could be useful.

The workaround I'm currently using is adding an artificial delay of a few days to affected accounts (see this comment). But that (1) introduces an unnecessary delay in processing/tracking, (2) only works for most - not all - cases as posting can sometimes take up to 30 days and (3) makes reconciliation (i.e., comparing to real-world account balance) quite difficult.

Yes I can see that delay would be annoying and limiting.

PS another problem case is records which are date-ordered but not stably-ordered, ie the relative positions of records on the same day might change from one download to the next. I'm not sure if that's being addressed here.

I haven't seen this across the 20+ accounts I currently sync with ledger and I'd expect this to be much less common. Are you aware of any real-world examples of this?

Yes I have seen this occasionally with certain banks/institutions (I forget which - Bank of Ireland might be one). Usually it's rare, but it's something to consider when evaluating strategies.

My two cents as a long-time user and occasional contributor: I feel that any use-case that requires repeated imports of the same input file would be much easily solved in a way that does not involve import at all.

The csv first approach, vs journal first. It's good for some people, but not a magic bullet:

  • As @lfos says, not all download sources (and they're not always banks) provide long term or reliable storage of your history as CSV. I have seen cases where it's ~3 months or less. Also the policy can change at any time without warning; if you didn't download before that you'll be out of luck.

  • So I think the csv-first PTA user must take responsibility for downloading and storing the full history, taking care not to overlap or to leave gaps each time they download. This can be error-prone and tedious.

  • As @adept says, downloading only at statement boundaries (if that's possible / reliably synced with the CSV records) could be one way to reduce the effort. But that sacrifices timely updates and day-to-day awareness.

  • A journal file is strictly more expressive and more direct than CSV files plus rules and conversion scripts and software. It handles special cases more easily, and it's more human-readable, version-control-readable, and future proof.

(Insert advantages of csv-first here) (Editing this comment to keep them in one place:)

Potential advantages of csv-first:

  • A journal's greater expressiveness and flexibility can be good or bad. More self-discipline and/or occasional cleanups can be needed to avoid creating inconsistent/messy entries. Limiting yourself to CSV records and conversion rules enforces some extra constraint and consistency.

simonmichael avatar Sep 24 '25 02:09 simonmichael

Preprocessing is quite easy to do from hledger rules now. But even if cumbersome scripts are required, prototyping and validating the approach outside hledger first could be useful.

I implemented a PoC in Python. The core logic is about 40 lines of code -- the total script around 240 lines of code, most of it replicating hledger rules parsing, as I wanted to avoid introducing another config file. I ran a few tests and it's working great so far. I'll use it a bit more and might polish/share it in a few days if there's interest.

lfos avatar Sep 24 '25 03:09 lfos

Hi Lukas! Thanks for working on this!

PS another problem case is records which are date-ordered but not stably-ordered, ie the relative positions of records on the same day might change from one download to the next. I'm not sure if that's being addressed here.

Another edge case to consider is that banks may also randomly rename the description for old transactions between csv exports. I suppose it's up to the user to decide how to generate the state-descriptor field, but if the bank description isn't stable, what would would you use? It's not safe to only use the transaction amount and expect no two transactions to have the same amount on any given day.

I don't quite understand Change 3 - Import retroactively inserted entries. Perhaps it will be clear to me when I see your POC.

In case you @lfos haven't seen it, check out @adept's https://github.com/adept/full-fledged-hledger. It solves this problem in an elegant and functionally declarative way. After several years of using full-fledged-hledger, I went back to plain hledger only because I wanted to be able to manually edit the generated journal files.

josephmturner avatar Sep 24 '25 06:09 josephmturner

Another edge case to consider is that banks may also randomly rename the description for old transactions between csv exports. I suppose it's up to the user to decide how to generate the state-descriptor field, but if the bank description isn't stable, what would would you use? It's not safe to only use the transaction amount and expect no two transactions to have the same amount on any given day.

How would you (as a human) manually distinguish those transactions? The answer will likely determine how to choose the field.

Note that using only transaction amounts wouldn't be totally unsafe either -- it'd actually be fine to do that if the CSV didn't have any information other than amounts.

In your example, using only amounts would only mean that if

  1. a transaction is imported and
  2. in a second import, another transaction appears on the same date AND before the already imported transaction AND with the same amount,

that new transaction may have an incorrect description in the import. (The first imported transaction would essentially be imported twice.) The imported amounts and account balances would still be correct.

Of course, with an incorrect description, CSV import rules may do the wrong thing, but that brings me back to my original question of how those would distinguish the transactions in the import rules; the answer would likely explain how you need to change your state-descriptor field.

Do you have a concrete real-world example of the scenario you described?

In case you @lfos haven't seen it, check out @adept's https://github.com/adept/full-fledged-hledger. It solves this problem in an elegant and functionally declarative way. After several years of using full-fledged-hledger, I went back to plain hledger only because I wanted to be able to manually edit the generated journal files.

Thanks, I hadn't seen it. It seems to rely heavily on CSV rules. I currently can't envision that being feasible for my setup but might give it a shot a some point to see if it's just a matter of adapting workflows and tooling.

lfos avatar Sep 24 '25 11:09 lfos

Thanks, I hadn't seen it. It seems to rely heavily on CSV rules.

It does! Most of the rules are written in pipe-separated-values (psv) files, which are then imported into "main" CSV rules. This allows the bulk of CSV rules to live in a nice machine-parseable format. I've also made a simple script that aids in going through expenses:unknown and income:unknown transactions, classifying them, and adding new rules to the .psv files.

I've checked, and overall, I have ~2800 CSV rules accumulated this way so far.

adept avatar Sep 24 '25 19:09 adept

How would you (as a human) manually distinguish those transactions? The answer will likely determine how to choose the field.

Note that using only transaction amounts wouldn't be totally unsafe either -- it'd actually be fine to do that if the CSV didn't have any information other than amounts.

In your example, using only amounts would only mean that if

1. a transaction is imported and

2. in a second import, another transaction appears on the same date AND before the already imported transaction AND with the same amount,

that new transaction may have an incorrect description in the import. (The first imported transaction would essentially be imported twice.) The imported amounts and account balances would still be correct.

Of course, with an incorrect description, CSV import rules may do the wrong thing, but that brings me back to my original question of how those would distinguish the transactions in the import rules; the answer would likely explain how you need to change your state-descriptor field.

Do you have a concrete real-world example of the scenario you described?

IIUC, the cases were usually minor changes, like abbreviations/expansions of common words or addition of asterisks, or changes in capitalization. The changes often apply through the entire exported CSV history, whereas the retroactively added transactions were never inserted more than a week after the actual transaction date. It also doesn't happen regularly, perhaps once or twice a year.

While the changes in transaction descriptions broke my use of full-fledged-hledger, I think the approach you propose in OP would work if we don't "keep all imported transaction dates, and make the CSV rule a on/off toggle." If I set keep-extra-state 7d then I'd be mostly unaffected if the descriptions change.

Also, perhaps we could allow users to keep the entire history with a special case like keep-extra-state -1.

Thanks, I hadn't seen it. It seems to rely heavily on CSV rules. I currently can't envision that being feasible for my setup but might give it a shot a some point to see if it's just a matter of adapting workflows and tooling.

I am a fan of the functional approach, but when the inputs are inconsistent it becomes less reliable. When I migrated back to hledger import, I kept a number of @adept's other good ideas, like the import/{acct1,acct2,acct3}/{dirty-csv,clean-csv,rules,journal} folder structure as well as the export folder which contains a number of automatically generated exports each time I import new data.

josephmturner avatar Sep 27 '25 01:09 josephmturner