jayvee icon indicating copy to clipboard operation
jayvee copied to clipboard

[BUG] Cannot parse CSV with newlines

Open jrentlez opened this issue 1 year ago • 0 comments

Steps to reproduce

  1. create file data_with_newlines.csv
C1,C2,C3
2,"some
text",true
  1. run this model with debug output: jv pipeline.jv -d -dg exhaustive:
pipeline Pipeline {

	Extractor
		-> ToTextFile
		-> ToCSV
		-> ToTable
		-> Loader;


	block Extractor oftype LocalFileExtractor {
		filePath: "./data_with_newline.csv";
	}

	block ToTextFile oftype TextFileInterpreter { }

	block ToCSV oftype CSVInterpreter {
		enclosing: '"';
	}

	block ToTable oftype TableInterpreter {
		header: true;
		columns: [
			"C1" oftype integer,
			"C2" oftype text,
			"C3" oftype boolean,
		];
	}

	block Loader oftype SQLiteLoader {
		table: "Data";
		file: "./Data.sqlite";
	}
}

Description

  • Expected: The model parses the csv correctly, without errors
  • Actual: The TextFileInterpreter splits "some text" into distinct lines and passes them to CSVInterpreter.
Found 1 pipelines to execute: Pipeline
[Pipeline] Overview:
        Blocks (5 blocks with 1 pipes):
         -> Extractor (LocalFileExtractor)
                 -> ToTextFile (TextFileInterpreter)
                         -> ToCSV (CSVInterpreter)
                                 -> ToTable (TableInterpreter)
                                         -> Loader (SQLiteLoader)

        [Extractor] Successfully extraced file ./data_with_newline.csv
        [Extractor] [Output] <hex> 43312C43322C43330A322C22736F6D650A74657874222C747275650A
        [Extractor] Execution duration: 2 ms.
        [ToTextFile] Decoding file content using encoding "utf-8"
        [ToTextFile] Splitting lines using line break /\r?\n/
        [ToTextFile] Lines were split successfully, the resulting text file has 3 lines
        [ToTextFile] [Output] [Line 0] C1,C2,C3
        [ToTextFile] [Output] [Line 1] 2,"some
        [ToTextFile] [Output] [Line 2] text",true
        [ToTextFile] Execution duration: 1 ms.
        [ToCSV] Parsing raw data as CSV using delimiter ","
        [ToCSV] Execution duration: 4 ms.
        error: CSV parse failed in line 2: Parse Error: missing closing: '"' in line: at '"some'
        $In /home/jonas/Code/uni/hiwi/jayvee/pipeline.jv:20:8
        20 |     block ToCSV oftype CSVInterpreter {
           |           ^^^^^

        [ToCSV] Execution duration: 8 ms.

Additional Notes

The library we use for csv parsing fast-csv could parse the newline correctly, if it gets the input data before it's split.

jrentlez avatar Jul 30 '24 14:07 jrentlez