jayvee
jayvee copied to clipboard
[BUG] Cannot parse CSV with newlines
Steps to reproduce
- create file
data_with_newlines.csv
C1,C2,C3
2,"some
text",true
- run this model with debug output:
jv pipeline.jv -d -dg exhaustive:
pipeline Pipeline {
Extractor
-> ToTextFile
-> ToCSV
-> ToTable
-> Loader;
block Extractor oftype LocalFileExtractor {
filePath: "./data_with_newline.csv";
}
block ToTextFile oftype TextFileInterpreter { }
block ToCSV oftype CSVInterpreter {
enclosing: '"';
}
block ToTable oftype TableInterpreter {
header: true;
columns: [
"C1" oftype integer,
"C2" oftype text,
"C3" oftype boolean,
];
}
block Loader oftype SQLiteLoader {
table: "Data";
file: "./Data.sqlite";
}
}
Description
- Expected: The model parses the csv correctly, without errors
- Actual: The
TextFileInterpretersplits "some text" into distinct lines and passes them toCSVInterpreter.
Found 1 pipelines to execute: Pipeline
[Pipeline] Overview:
Blocks (5 blocks with 1 pipes):
-> Extractor (LocalFileExtractor)
-> ToTextFile (TextFileInterpreter)
-> ToCSV (CSVInterpreter)
-> ToTable (TableInterpreter)
-> Loader (SQLiteLoader)
[Extractor] Successfully extraced file ./data_with_newline.csv
[Extractor] [Output] <hex> 43312C43322C43330A322C22736F6D650A74657874222C747275650A
[Extractor] Execution duration: 2 ms.
[ToTextFile] Decoding file content using encoding "utf-8"
[ToTextFile] Splitting lines using line break /\r?\n/
[ToTextFile] Lines were split successfully, the resulting text file has 3 lines
[ToTextFile] [Output] [Line 0] C1,C2,C3
[ToTextFile] [Output] [Line 1] 2,"some
[ToTextFile] [Output] [Line 2] text",true
[ToTextFile] Execution duration: 1 ms.
[ToCSV] Parsing raw data as CSV using delimiter ","
[ToCSV] Execution duration: 4 ms.
error: CSV parse failed in line 2: Parse Error: missing closing: '"' in line: at '"some'
$In /home/jonas/Code/uni/hiwi/jayvee/pipeline.jv:20:8
20 | block ToCSV oftype CSVInterpreter {
| ^^^^^
[ToCSV] Execution duration: 8 ms.
Additional Notes
The library we use for csv parsing fast-csv could parse the newline correctly, if it gets the input data before it's split.