Using select with explicit header names requires all the column names to be specified
Consider the following example.
csv = """1,2,3
1,2,3
1,2,3"""
CSV.File(IOBuffer(csv), select=[2, 3], header=0)
3-element CSV.File{false}:
CSV.Row: (Column2 = 2, Column3 = 3)
CSV.Row: (Column2 = 2, Column3 = 3)
CSV.Row: (Column2 = 2, Column3 = 3)
CSV.File("data.csv", select=[2, 3], header=["a", "b", "c"])
3-element CSV.File{false}:
CSV.Row: (b = 2, c = 3)
CSV.Row: (b = 2, c = 3)
CSV.Row: (b = 2, c = 3)
CSV.File("data.csv", select=[2, 3], header=["b", "c"])
thread = 1 warning: parsed expected 2 columns, but didn't reach end of line around data row: 1. Ignoring any extra columns on this row
thread = 1 warning: parsed expected 2 columns, but didn't reach end of line around data row: 2. Ignoring any extra columns on this row
thread = 1 warning: parsed expected 2 columns, but didn't reach end of line around data row: 3. Ignoring any extra columns on this row
3-element CSV.File{false}:
CSV.Row: (c = 2,)
CSV.Row: (c = 2,)
CSV.Row: (c = 2,)
I find this rather surprising. I can either specify all the column names, which may be not too nice in a file with a large number of columns, or go with header=0 and rename columns afterwards, which feels like an unnecessary step.
Sorry for the slow response here; yes, I can see how this is a bit confusing, but when you provide the header, it is expected that you're providing all the column names. It is a bit awkward with select when you only want a few columns, but I'm not quite sure what we can do that would be better here. If it's not obvious, even when you're select-ing a subset of columns, we still need to "skip" over the other columns.
In general, I've grown to have the impression that we perhaps give too much weight to user-provided headers. If provided, we basically take that as absolute truth for the # of columns. Perhaps we need to rethink the approach here and have CSV.jl do more of it's own work around what's actually in the file, allowing header to "rename" columns after the fact. I.e. we could allow passing header as a Dict to rename columns while parsing.
Renaming columns while parsing is a feature I would definitely appreciate.
I just encountered this issue.
The main issue is the file I was reading used spaces as delimiters and in a text field, so the input columns was variable. But I just needed the first few. So I thought I could select the first few by index and then provide the column names. Nope!
I got warnings about ignoring extra columns, which seemed fine; but I'm confused why it would drop rows.