invoice2data icon indicating copy to clipboard operation
invoice2data copied to clipboard

templates: de: improve parsing QualityHosting lines

Open rmilecki opened this issue 3 years ago • 4 comments

A single QualityHosting invoice position spans across multiple lines.
For that reason it uses a very generic RegEx for middle lines:
line: '^\s+(?P<desc>.+)$'

That doesn't work well with multi-page invoices. It's because above
RegEx matches page footer lines. That results in footer content getting
extracted as invoice line "desc".

Improve that situation by adding "last_line" RegEx matching position
last line. That prevents parsing lines between last and first lines
(e.g. footer content).

rmilecki avatar Oct 08 '22 20:10 rmilecki

So what do we do about this pull request?

It received some minor cross-pull-request comment in the #417:

Maybe better to leave suboptimal tests and examples in this library. Just as an showcase. (Same goes for the OCR examples in this repo). It is definityle helping us to find these corner cases.

As de.qualityhosting.yml is a template for actual (real-life) invoices I really think we should fix it. By that I mean accepting this pull request.

If we need to test some corner cases - that can be done using custom templates & tests. I added support for such in the just-merged #414.

rmilecki avatar Oct 15 '22 19:10 rmilecki

So what do we do about this pull request?

I'd suggest we let this one sit here for a moment. At least until we've sorted out the multiline parsing.

Wen't trought the re lib docs. need some time to test and verify things. Will report back in #417

bosd avatar Oct 16 '22 13:10 bosd

As in 417 mentioned. I agree to update this template. Possibly to include only lines with more then 30 spaces in front ^\s{30,}

For this particular invoice we could make the lastline specific, (or either leave it very generic, like before this pr) Just the question wat kind of example we would like to include.

As per your previous suggestion: '^\s+(?P<desc>\d\d\.\d\d\.\d\d-\d\d\.\d\d\.\d\d)$' I would like to propose to change it to '^\s{30,}(?P<desc>\d{2}[.]\d{2}[.]\d{2}[.]-\d{2}[.]\d{2}[.]\d{2})$' rewrote the . to [.] as it is best practice to prevent the use of a plain . preformance wise.

(I quikly drafted this, it needs testing..)

bosd avatar Oct 19 '22 12:10 bosd

Used this version of thetemplate to check the code of #417. Intentional it is without last_line. Because the purpose of the check was to see if it is adding a matched line to the output without the lastline key.

issuer: QualityHosting AG
fields:
  amount: Total EUR\s+(\d+,\d+)
  amount_untaxed: Total EUR\s+(\d+,\d+)
  date:
    - \s{2,}(\d+\. .+ \d{4})\s{2,}
    - Rechnungsdatum\s+(\w+ \d+, \d{4})
  invoice_number: Rechnungsnr\.\s+(\d{8})
  vat: DE 232 446 240
lines:
  start: 'Contract No. \w+'
  end: 'Total EUR'
  first_line: '\s+(?P<pos>\d+)\s+(?P<qty>\d+)\s+(?P<desc>.{,70})\s+(?P<price>\d+,\d+)'
  line: '^\s{30}(?P<desc>.{5,30})$'
  types:
      qty: float
      price: float
keywords:
- QualityHosting
options:
  currency: EUR
  decimal_separator: ","

@rmilecki Is it ok for you to use this version?

bosd avatar Oct 21 '22 19:10 bosd