hledger icon indicating copy to clipboard operation
hledger copied to clipboard

CSV files, variable number of fields

Open PSLLSP opened this issue 1 year ago • 4 comments

hledger 1.32.3

Current hledger can process only well formed CSV files. It can skip several lines at the start of the file. And a rule end can be used to force skip to the rest of the current file.

I would like to have an option to process CSV files with variable number of fields, like that when field is missing, hledger will not end with an error because field is missing.

  • there could be an option to define minimal number of mandatory fields
  • new variable will be defined, that will keep number of fields in the current record; awk has variable NF, something like that

There is rule skip in if block statement, this could be used to skip records that doesn't have correct number of fields or take an action on these records, like adding missing information.


I have CSV records, those have several header lines followed by CSV records those have data I want to process. I can design hledger rule file that process one such file but I cannot design rule file that process those files when joined with cat *.csv > import.csv.

Example:

$ cat import.csv
statement: 2023/01
account: 123456789/4321
owner: John Smith
start date: 1.1.2023
end date: 31.1.2023
opening balance:  100.00 EUR
closing balance:  100.50 EUR

"ID";"Date";"Volume";"Currency";"Description";"Note"
"54321";"31.01.2023";"0.5";"EUR";"Interest";""

statement: 2023/02
account: 123456789/4321
owner: John Smith
start date: 1.2.2023
end date: 28.2.2023
opening balance:  100.50 EUR
closing balance:  101.00 EUR

"ID";"Date";"Volume";"Currency";"Description";"Note"
"56324";"28.02.2023";"0.5";"EUR";"Interest";""

I would like to define that no fields are mandatory (to be able to skip empty lines) I would like to use if block rule to:

  • skip records/lines those do not have 6 fields
  • skip records where field1 (%1) has value ID - to ignore headers
  • process the rest like regular CSV records

I tried to design import rules for simplified case (all records have 6 fields) and it can be done. I only miss an option to compare number of CSV fields, so I have to use a workarround to find valid CSV records

CSV file for test:

$ cat test-import1.csv
"ID";"Date";"Volume";"Currency";"Description";"Note"
"54321";"31.01.2023";"0.5";"EUR";"Interest";""

"ID";"Date";"Volume";"Currency";"Description";"Note"
"56324";"28.02.2023";"0.6";"EUR";"Interest";""

"ID";"Date";"Volume";"Currency";"Description";"Note"
"56327";"31.03.2023";"0.7";"";"Interest";"CURRENCY is missing"

import rules:

$ cat test-import1.csv.rules
# hledger import rules

skip 0
separator ;

fields f1,f2,f3,f4,f5,f6

# debug; it prints only records those were accepted and included in the journal...
comment \n%f1;%f2;%f3;%f4;%f5;%f6

# define date format
date-format %d.%m.%Y

# skip header record
if %f1 ID
  skip

# skip records where "volume" field is empty.
# just a test that record can be ignored
if %f3 ^$
  skip

# other test for empty field.
#if ! %f3 .
#  skip

code  %f1
date  %f2
amount  %f3 %f4
description %f5 - %f6

# Currency is missing, define default value
# variable cannot be redefined :-( (%f4 USD)
# workarround - define currency or amount, like "amount %f3 USD"
if %f4 ^$
  amount %f3 USD
$ hledger -f test-import1.csv print
2023-01-31 (54321) Interest -
    ; 54321;31.01.2023;0.5;EUR;Interest;
    expenses:unknown         0.5 EUR
    income:unknown          -0.5 EUR

2023-02-28 (56324) Interest -
    ; 56324;28.02.2023;0.6;EUR;Interest;
    expenses:unknown         0.6 EUR
    income:unknown          -0.6 EUR

2023-03-31 (56327) Interest - CURRENCY is missing
    ; 56327;31.03.2023;0.7;;Interest;CURRENCY is missing
    expenses:unknown         0.7 USD
    income:unknown          -0.7 USD

It can even process file where some lines are not valid CSV records:

$ cat test-import2.csv
statement: 2023/01
account: 123456789/4321
owner: John Smith
start date: 1.1.2023
end date: 31.1.2023
opening balance:  100.00 EUR
closing balance:  100.50 EUR

"ID";"Date";"Volume";"Currency";"Description";"Note"
"54321";"31.01.2023";"0.5";"EUR";"Interest";""

statement: 2023/02
account: 123456789/4321
owner: John Smith
start date: 1.2.2023
end date: 28.2.2023
opening balance:  100.50 EUR
closing balance:  101.00 EUR

"ID";"Date";"Volume";"Currency";"Description";"Note"
"56324";"28.02.2023";"0.5";"EUR";"Interest";""
$ ln -sf test-import1.csv.rules test-import2.csv.rules
$ hledger -f test-import2.csv print
2023-01-31 (54321) Interest -
    ; 54321;31.01.2023;0.5;EUR;Interest;
    expenses:unknown         0.5 EUR
    income:unknown          -0.5 EUR

2023-02-28 (56324) Interest -
    ; 56324;28.02.2023;0.5;EUR;Interest;
    expenses:unknown         0.5 EUR
    income:unknown          -0.5 EUR

PSLLSP avatar Mar 29 '24 14:03 PSLLSP

Is it possible to check the amount of column's in a CSV file? At least for dealing with older/newer exports it could be useful.

Thaodan avatar Aug 22 '25 21:08 Thaodan

Yes, you can use a conditional's regexp to count the separator character and apply different rules based on the number of fields. Eg, for comma-separated data:

# Skip section 1's 6-field lines
if ^([^,]*,){6}$
 skip

I think you could also name all possible fields and then select based on whether rightmost fields contain a value.

simonmichael avatar Aug 23 '25 07:08 simonmichael

Does that also work with field assignments? I.e. if a newer format contains an extra field that wasn't present before.

Thaodan avatar Aug 23 '25 13:08 Thaodan

I'm not sure - try it and let us know ?

simonmichael avatar Aug 24 '25 10:08 simonmichael