hledger icon indicating copy to clipboard operation
hledger copied to clipboard

Import will miss transactions that show up on a date earlier than the date stored in .latest file

Open mtbar131 opened this issue 3 years ago • 10 comments

Hi,

First of all, thanks a lot for hledger and all of its documentation, it's great!

I started using hledger and noticed a possible bug where some transactions will not be imported if they get added on a date that is same as the date stored in the .latest file of the corresponding import file. Many credit card companies keep a transaction in a pending state for anywhere between 2-5 days, but when such a transaction completes it still shows up on it's original date (i.e the date on which it was posted for the very first time) in the statement. This can cause hledger to think that the transaction has already been imported and it will incorrectly skip importing that transaction.

Here is a concrete example: I buy something using my credit card on 1st Jan 2022. This transaction shows up as a pending transaction on my credit card on the same day. On 1st Jan 2022 I download a CSV of my credit card transactions and import it in hledger. Note that the pending transactions aren't a part of the CSV so this transaction doesn't show up in CSV. hledger imports all the transactions and stores 1st Jan 2022 in the .latest file for future reference. Now this transaction moves to the completed/confirmed state on 5th Jan, but in the credit card statement it still shows up with 1st Jan 2022 as the transaction date. At this point if I download the CSV from my Credit card vendor and import it in hledger, hledger has a .latest file that says all transactions upto 1st Jan 2022 are imported and so this newly added transaction will never be imported in hledger. Same thing happens even if I download and import the CSV on any day between 1st Jan and 5th Jan.

I agree that this is a problem with the credit card vendor generated data and not an hledger issue but I think we might be able to make a change to effectively handle this.

Would it be possible to assign an unique transaction ID for each transactions that is imported? This ID will be a hash of the transaction date, description and amount and so it should never collide with another transaction. That way we can get rid of the .latest file and can be always sure that no transactions are missed during the import.

Having said that, I understand that maybe such a change is currently not possible. Maybe there is a simpler solution already available. In that case please redirect me to that.

mtbar131 avatar Nov 02 '22 05:11 mtbar131

Thanks!

Yes, it's mentioned in docs that the built in deduplication system doesn't handle this case.

I agree your approach would be great in some ways. It's not the one we use because actually it's quite possible for identical-looking transactions, or at least identical csv records, to occur in practice. A simple example is when a vendor accidentally double-charges your card. But, more practical experiments with these alternate approaches are very welcome.

simonmichael avatar Nov 02 '22 17:11 simonmichael

#1960 has a script which could handle this under certain conditions.

simonmichael avatar Dec 13 '22 15:12 simonmichael

The script in question: https://git.sr.ht/~breatheoutbreathein/hledger-import-new-xacts

josephmturner avatar Dec 14 '22 00:12 josephmturner

Thanks @josephmturner for the script. I like the idea of this script. I haven't tried it yet because I don't have separate journals for each account. But if I am unable to find any other way around this problem I will try this.

@simonmichael, for the scenarios where using the above script is not feasible, would it be possible to add a flag to the import command that forces it to generate a unique ID for each transaction like I mentioned in the issue description above? If a duplicate ID is found, it can stop and ask the user if that transaction should be imported or not (maybe also show what duplicates transactions are).

Btw do we know how other budgeting systems like GnuCash or ledger sovle this problem?

mtbar131 avatar Jan 21 '23 21:01 mtbar131

@mtbar131 I'd be happy to test any experiment like that, or review any written functional specification, if someone wants to work on it.

simonmichael avatar Jan 22 '23 00:01 simonmichael

Ledger generates and saves a UUID for each transaction, from the CSV record, and skips those transactions when it sees them again, as you are wishing for. Testing this with Ledger in various situations (similar transactions, identical repeated transactions, consecutively or far apart, real world data...) could help validate the approach and save some time. Or we can do that with our own PR. It'll need some thought and UX around when each approach is enabled and if/how they should interact.

https://www.ledger-cli.org/3.0/doc/ledger3.html#The-convert-command

simonmichael avatar Jan 22 '23 00:01 simonmichael

Ledger generates and saves a UUID for each transaction, from the CSV record, and skips those transactions when it sees them again

Interesting! The manual says it compares checksums of each CSV line. I wonder if Ledger does something additional in order to handle dupes in the CSV input.

Either way, IIUC this approach would not work for me, since my bank often retroactively changes certain CSV fields. Some of those fields can be excluded by a preprocessor, but sometimes the description is changed.

josephmturner avatar Jan 22 '23 08:01 josephmturner

I am facing exactly the same issue. The workaround I'm using so far is preprocessing the CSV files to add an import "delay"; e.g.,

awk -v cutoff_date=$(date -d '5 days ago' '+%Y-%m-%d') 'NR==1 || $1 < cutoff_date' 

This means some transactions are imported only after ~a week, even if already posted, but that's good enough for me.

If a "date posted" or "date processed" field is available in your CSV, another option is to use that column (instead of the regular transaction date field) as date field when importing. Two caveats:

  1. This only works if the "stable chronological order" holds for this column in your CSV, e.g., if all pending transactions to be posted on a given day are processed overnight.
  2. This means that the transactions also appear with posting date (instead of transaction date) in your journal.

@simonmichael - As far as I can tell there's currently no way to work around limitation (2) above; i.e., there's no way to use a transaction date different from what's put in the .latest file for deduplication; is that correct?

lfos avatar Sep 23 '25 13:09 lfos

I also wrote up a concrete proposal for how this could be improved in hledger with a few relatively small backwards-compatible changes here. I decided to put it in a separate ticket #2464, as it's quite lengthy and I didn't want to mix discussions around the general request with comments on the proposal.

lfos avatar Sep 23 '25 17:09 lfos

@lfos hard to think through that one in detail right now but I think you're right. What's saved in the .latest file is an actual transaction date extracted from a parsed/generated journal.

simonmichael avatar Sep 23 '25 18:09 simonmichael