beangulp icon indicating copy to clipboard operation
beangulp copied to clipboard

Don't merge deduped entries into existing_entries.

Open hoostus opened this issue 11 months ago • 8 comments

Merging pollutes the existing_entries for hooks that get run later. Instead we track them separately but consider the union(existing_entries, new_entries) when running the deduplication logic, since we also want to remove duplicates from other importers in this same.

hoostus avatar Mar 06 '25 07:03 hoostus

Agreed -- I came to report this issue as well because of an effect I saw with smart_importer.

When new entries are merged with existing ones, the combined set is passed to smart_importer as training data. This includes single-leg entries, which shouldn’t be considered valid training examples. As a result, smart_importer is producing single-leg entries in my journal, since it learns from these incomplete examples.

hunner avatar Jun 12 '25 15:06 hunner

I see the problem - I'm not 100% that this should be desired behavior though, the prior behavior seems reasonable to me - but there's order dependence issue this creates if you wanted it that way. Depending on the order that you're processing the importers you would get different results.

blais avatar Jun 12 '25 16:06 blais

there's order dependence issue this creates if you wanted it that way. Depending on the order that you're processing the importers you would get different results.

I'm not sure what you mean exactly, but entries are sorted before being deduplicated https://github.com/beancount/beangulp/blob/35fea180ed65dbe9113e1566c88ba34b38d1fcb5/beangulp/extract.py#L64-L114 so, as long as you process the same files, the output is deterministic

dnicolodi avatar Jun 12 '25 17:06 dnicolodi

Depending on the order that you're processing the importers you would get different results.

I'm not sure what you mean exactly, but entries are sorted before being deduplicated. so, as long as you process the same files, the output is deterministic

I guess if same input files are kept but the order of importers is changed then that affects the existing_entries that each successive importer.deduplicate in the deduplication for-loop sees.

I'm not 100% that this should be desired behavior though, the prior behavior seems reasonable to me

I would definitely defer to your experience (and thank you soooo much for all your time and effort in creating beancount!) but would like to understand more. It looks like the original idea for feeding the already-evaluated entries into the existing_entries came from #9 to allow bank transfer transactions to only have one side marked as a duplicate during multi-file imports. It was fixed in #64 and updated to deduplicate in-place in #114.

There are two areas that this code impacts: importers and hooks.

For hooks, it does seem like adding the new entries to the existing_entries list before calling the hooks misrepresents what the user has already touched vs. what the hooks should be processing. It would be nice to keep the distinction clear.

For importer.deduplicate(), the distinction between existing entries and new entries doesn't matter so much, since it is only concerned with deduplicating its own entries against all other entries, new or not.

The current diff proposes keeping a record of all observed entries during deduplication for each importer's deduplicate method to evaluate, then throwing that record away instead of reusing it as existing_entries for hooks. Because extract.mark_duplicate_entries() now marks the entries as duplicate in-place, the hooks will receive a list of new entries with duplicates already recorded. I don't think any info is lost, and if a hook wants to have the union of existing entries and new entries then it can iterate over the new entries and extend existing_entries and be back to the current functionality.

The original intention of multi-file import duplicate detection still works with this proposed change, and the arguments passed to hooks are correctly named.

hunner avatar Jun 12 '25 20:06 hunner

I guess if same input files are kept but the order of importers is changed then that affects the existing_entries that each successive importer.deduplicate in the deduplication for-loop sees.

I guess you didn't read the large comment in the code that I linked that describes the sorting key used.

dnicolodi avatar Jun 12 '25 23:06 dnicolodi

the prior behavior seems reasonable to me

I dunno, it means that hooks are being passed an invalid beancount file as "existing entries", which is at least slightly surprising. I don't think it is unreasonable that hooks would expect the existing entries to be something that would pass bean-check. If hooks are expected to deal with malformed beancount data then we should probably clarify what variants of well-formed beancount syntax they might be expected to deal with in the "existing entries" so they can write appropriately defensive code.

I'm not exactly averse to any of that -- importing stuff is kind of a weird liminal state instead of fully baked -- but, as someone who wrote a hook, I was caught out by beangulp breaking the (admittedly somewhat implicit) contract that "existing entries" were well-formed from a bean-check kind of point of view. I expected it in the imported data but not the existing data.

hoostus avatar Jun 13 '25 00:06 hoostus

Thanks for spotting this issue and sorry for not replying on the PR earlier: I had to find a bit of time to check what the code does against what it is supposed to be doing and what it is sensible to do. I agree that the current behavior is a bug. Newly extracted entries should not be appended to the existing entries before these are passed to the hook functions.

I have only one issue with the proposed fix: it repeatedly concatenates possibly very long lists. I need to think a bit more about it, but I think that the deduplication process should be additive, namely deduplicate(A, B + C) should give the exact same result of deduplicate(A, B); deduplicate(A, C) where the first argument is the list of new entries and the second argument is the list of existing entries. Thus it may be more efficient to run consecutive dedup() calls than concatenating and sorting lists.

dnicolodi avatar Jun 13 '25 12:06 dnicolodi

This really needs to be fixed. I had to independently realize the same thing that the current behavior makes smart_importer completely useless: https://github.com/beancount/smart_importer/issues/149.

sim642 avatar Jun 21 '25 17:06 sim642