Add comprehensive CSV error handling with recovery strategies
Comprehensive CSV Error Handling
Problem
The current CSV parser lacks robust error handling capabilities, which causes processing to fail on malformed data commonly found in real-world scenarios.
Solution
This PR adds a comprehensive error handling system with:
- A structured hierarchy of error types to provide detailed diagnostics
- Recovery strategies to continue processing despite errors
- Pattern detection to identify systematic issues in data
Implementation Details
- Created new
CsvErrortrait with specific error subtypes - Added recovery-enabled parser with configurable strategies
- Implemented comprehensive test suite to verify functionality
Benefits
- Makes the library more resilient to real-world messy data
- Provides users with clear diagnostics about data quality
- Allows processing to continue despite partial errors
The key question for me with this is how it could be reasonably integrated into the compiletime file reading constraint.
I'd again love to see a test which uses this to parse a file. There are lots of examples waiting to be tested out. I'd probably merge this if we demonstrate testing behaviour which is obviously more ergonomic than the status quo. For example this;
test("reading data 3") {
val csv = CSV.absolutePath(Generated.resourceDir0 + "/testFile/missing_values.csv")
println(csv.toSeq.ptbln)
}
Works just fine. It leaves the user with the burden of deciding how to handle missing values... but I think maybe that is okay.
test("reading data 3") {
val csv: CsvIterator[("Products", "Sales", "Market_Share")] = CSV.absolutePath(Generated.resourceDir0 + "/testFile/missing_values.csv")
val csv2 = csv.mapColumn["Sales", Option[Int]](_.toIntOption)
println(csv2.toSeq.ptbln)
}
Would be how I would deal with it. I'm not against this kind of error handling, but I don't know how much it's absolutely necessary?
Hi Simon,
Thank you for your feedback on the PR! I've addressed your questions about integration with compile-time typing by implementing a two-phase approach that complements, rather than replacing, the existing Options-based handling.
The idea is:
- Phase 1: Use recovery strategies to fix structural issues before parsing
- Phase 2: Use compile-time typing with Options for clean type handling
I've added:
- A combine method to RecoveryStrategy that allows combining PadWithNulls and Truncate
- A CsvPreprocessor utility that demonstrates the preprocessing workflow
- A test in testCsvRecovery.scala that compares both approaches
The test shows how the current Options approach works well for missing values but fails with structural errors (too few/many fields), while the recovery strategies handle those cases elegantly.
This approach particularly helps with user-uploaded or external CSV files where structure can't be guaranteed at compile time.
I'd appreciate your feedback on these changes before considering this PR ready for merge.
I looked through this is more detail, and I've realised that you are writing your own entry points for the CSV file.
I'd want to understand how this is compatible with the fundamental concept of the library being a Iterator / Iterable of NamedTuples. Given the constraints that surround that design, I don't see how this concept can work.
As things stand I think this PR is irreconcilable with the goals of this library. I may be wrong and am happy for you to write up some test cases demonstrating how it could be integrated.
@Quafadas, I think there is a place for these recovery strategies.
This would be a nice feature, dare I say, to implement on the main entry points of the library. Instead of using the Either type, I would provide warnings, though. In the best case, the library would help the user work on the data right away. And in the worst case, it would provide a reasonable error message. This is the most ergonomic way I can think of.
Having a formal approach, such as processing the file on a separate function, has its advantages, but I don't think this is necessary here.[^1]
That said, I don't have sure if this would be straightforward to implement on the main entry points. If you give the thumbs up, I'm available to assist in implementing this idea (if krishivseth wishes, that is).
[^1]: For example, Python has this nice feature that every time you try to exit one of its shells by just typing exit, he tells you to type exit() or Ctrl-D instead. This is funny; Python knows exactly what you want to do, and it knows exactly how to do it (but doesn't). Python opts for a more idealistic approach that helps keep the language sane instead of a pragmatic one that would make the user's life easier.
@Dorodri - I'm not against the principle.
My point is that I don't see how this particular PR is compatible with the design of the library.
Perhaps I am wrong, and I am open to reading test cases or examples which convince me. Currently, I do not see that in this PR. If you publish local or fire this branch up in a console, I would be interested to see if you share the opinion.
I'm not against the principle.
That's nice to know.
My point is that I don't see how this particular PR is compatible with the design of the library.
I would be interested to see if you share the opinion.
I agree: I don't think this PR, as is, is compatible with the library's concept.
My own point here is that the code is still valuable. So I'm raising the hypothesis; it may require some work, but maybe we can work out a solution to integrate these recovery strategies alongside the existing parser. What do you think?
Ideally, this error handling would be part of entry points like CSV.absolutePath (as it would work on the parser layer). That said, I'm still intrigued to see if this is actually an option. I'll try to make a working solution to show for.
Hi @Quafadas and @Dorodri,
Thanks again for the detailed feedback on this PR. I agree that the standalone CsvPreprocessor approach adds an extra step and doesn't quite fit the library's goal of a seamless compile-time to runtime flow.
Following up on Dorodri's suggestion, I think integrating the recovery strategies directly into the main CsvIterator makes the most sense. This way one call to CSV.absolutePath("data.csv") (or the other entry points) would handle structural errors internally.
Here's the proposed plan:
-
Modify CsvIterator:
-
Internally, it will check if the number of parsed fields matches the expected count from the headers.
-
If there's a mismatch, it will apply a default recovery strategy (e.g., RecoveryStrategy.combine(PadWithNulls, Truncate)).
-
We can make the strategy configurable if needed later, but start with a sensible default.
-
-
Log Warnings:
- If a row is successfully padded or truncated, log a [WARNING] message indicating recovery occurred on that line.
-
Skip Unrecoverable Rows:
- If the strategy cannot fix the row (or returns None), log an [INFO] message that the row was skipped and proceed to the next line.
-
Clean API:
-
The CsvIterator returned by CSV.absolutePath etc. will only yield valid or successfully recovered rows as NamedTuples.
-
The CsvPreprocessor object can then be removed.
-
This approach keeps the existing API entry points, integrates the recovery logic directly into the iterator flow, and provides informative warnings without halting execution.
Does this integration plan sound like a good path forward? Are there edge cases with the default "pad/truncate then skip" logic (like handling comment lines or footers) that we should consider more explicitly? Happy to discuss further and implement whichever approach works best. Please feel free to suggest/make any changes and help implement whatever works!