CSV files, variable number of fields
hledger 1.32.3
Current hledger can process only well formed CSV files. It can skip several lines at the start of the file. And a rule end can be used to force skip to the rest of the current file.
I would like to have an option to process CSV files with variable number of fields, like that when field is missing, hledger will not end with an error because field is missing.
- there could be an option to define minimal number of mandatory fields
- new variable will be defined, that will keep number of fields in the current record;
awkhas variableNF, something like that
There is rule skip in if block statement, this could be used to skip records that doesn't have correct number of fields or take an action on these records, like adding missing information.
I have CSV records, those have several header lines followed by CSV records those have data I want to process. I can design hledger rule file that process one such file but I cannot design rule file that process those files when joined with cat *.csv > import.csv.
Example:
$ cat import.csv
statement: 2023/01
account: 123456789/4321
owner: John Smith
start date: 1.1.2023
end date: 31.1.2023
opening balance: 100.00 EUR
closing balance: 100.50 EUR
"ID";"Date";"Volume";"Currency";"Description";"Note"
"54321";"31.01.2023";"0.5";"EUR";"Interest";""
statement: 2023/02
account: 123456789/4321
owner: John Smith
start date: 1.2.2023
end date: 28.2.2023
opening balance: 100.50 EUR
closing balance: 101.00 EUR
"ID";"Date";"Volume";"Currency";"Description";"Note"
"56324";"28.02.2023";"0.5";"EUR";"Interest";""
I would like to define that no fields are mandatory (to be able to skip empty lines)
I would like to use if block rule to:
- skip records/lines those do not have 6 fields
- skip records where field1 (
%1) has valueID- to ignore headers - process the rest like regular CSV records
I tried to design import rules for simplified case (all records have 6 fields) and it can be done. I only miss an option to compare number of CSV fields, so I have to use a workarround to find valid CSV records
CSV file for test:
$ cat test-import1.csv
"ID";"Date";"Volume";"Currency";"Description";"Note"
"54321";"31.01.2023";"0.5";"EUR";"Interest";""
"ID";"Date";"Volume";"Currency";"Description";"Note"
"56324";"28.02.2023";"0.6";"EUR";"Interest";""
"ID";"Date";"Volume";"Currency";"Description";"Note"
"56327";"31.03.2023";"0.7";"";"Interest";"CURRENCY is missing"
import rules:
$ cat test-import1.csv.rules
# hledger import rules
skip 0
separator ;
fields f1,f2,f3,f4,f5,f6
# debug; it prints only records those were accepted and included in the journal...
comment \n%f1;%f2;%f3;%f4;%f5;%f6
# define date format
date-format %d.%m.%Y
# skip header record
if %f1 ID
skip
# skip records where "volume" field is empty.
# just a test that record can be ignored
if %f3 ^$
skip
# other test for empty field.
#if ! %f3 .
# skip
code %f1
date %f2
amount %f3 %f4
description %f5 - %f6
# Currency is missing, define default value
# variable cannot be redefined :-( (%f4 USD)
# workarround - define currency or amount, like "amount %f3 USD"
if %f4 ^$
amount %f3 USD
$ hledger -f test-import1.csv print
2023-01-31 (54321) Interest -
; 54321;31.01.2023;0.5;EUR;Interest;
expenses:unknown 0.5 EUR
income:unknown -0.5 EUR
2023-02-28 (56324) Interest -
; 56324;28.02.2023;0.6;EUR;Interest;
expenses:unknown 0.6 EUR
income:unknown -0.6 EUR
2023-03-31 (56327) Interest - CURRENCY is missing
; 56327;31.03.2023;0.7;;Interest;CURRENCY is missing
expenses:unknown 0.7 USD
income:unknown -0.7 USD
It can even process file where some lines are not valid CSV records:
$ cat test-import2.csv
statement: 2023/01
account: 123456789/4321
owner: John Smith
start date: 1.1.2023
end date: 31.1.2023
opening balance: 100.00 EUR
closing balance: 100.50 EUR
"ID";"Date";"Volume";"Currency";"Description";"Note"
"54321";"31.01.2023";"0.5";"EUR";"Interest";""
statement: 2023/02
account: 123456789/4321
owner: John Smith
start date: 1.2.2023
end date: 28.2.2023
opening balance: 100.50 EUR
closing balance: 101.00 EUR
"ID";"Date";"Volume";"Currency";"Description";"Note"
"56324";"28.02.2023";"0.5";"EUR";"Interest";""
$ ln -sf test-import1.csv.rules test-import2.csv.rules
$ hledger -f test-import2.csv print
2023-01-31 (54321) Interest -
; 54321;31.01.2023;0.5;EUR;Interest;
expenses:unknown 0.5 EUR
income:unknown -0.5 EUR
2023-02-28 (56324) Interest -
; 56324;28.02.2023;0.5;EUR;Interest;
expenses:unknown 0.5 EUR
income:unknown -0.5 EUR
Is it possible to check the amount of column's in a CSV file? At least for dealing with older/newer exports it could be useful.
Yes, you can use a conditional's regexp to count the separator character and apply different rules based on the number of fields. Eg, for comma-separated data:
# Skip section 1's 6-field lines
if ^([^,]*,){6}$
skip
I think you could also name all possible fields and then select based on whether rightmost fields contain a value.
Does that also work with field assignments? I.e. if a newer format contains an extra field that wasn't present before.
I'm not sure - try it and let us know ?