datatable [bug] Unexpected behavior when reading data with an uneven numeber of columns per row

I am not 100% sure whether this is a bug or I'm just doing something wrong, but based on the documentation, this behavior should not happen the way it is.

I try to use datatable to read DNS zone file data, which has very uneven number on columns per rows but encountered an issue depending on whether the row with the maximum amount columns is part of one of the first rows or not. If that's the case, everything is fine, otherwise datatable throws an exception. I already use fill=True to fill the missing fields, but this doesn't work.

To reproduce this issue, I created a small repo with code examples and sample data: scattenlaeufer/datatable_column_error

There is also a small instruction on how to reproduce the issue.

The expected behavior would be for datatable to read files with an uneven number of columns in its rows independent on where the line with the most columns is.

My environment is:

Arch Linux
Python 3.9.2
datatable 1.0.0 (current main branch)

Apr 08 '21 14:04 scattenlaeufer

are the comments in the tsv files (works, error) part of the data?

Apr 11 '21 13:04 samukweku

Yes. They are what makes the makes the number of columns uneven and causes the error.

Choosing errorand works in the tsv files was done to make sure to have chosen the right files. Maybe I should have chosen something more self explaining, sorry for that.

Apr 11 '21 13:04 scattenlaeufer

probably a bug that the maintainers can look into. I tried it out using R's data.table with the fill=TRUE option and it worked fine for both the good and bad data.tsv.

Apr 11 '21 14:04 samukweku

I found a workaround, which makes datatable a viable option for my case: Data get red correctly, when the file contains a header line giving each column a name. But this might not be an applicable solution for every use case, especially when data is delivered in an archive.

Apr 14 '21 07:04 scattenlaeufer

@scattenlaeufer Glad you found a fix for it ... Maybe you can write a blog about it and share your experience?

Apr 14 '21 08:04 samukweku

There's indeed an issue when extra columns are encountered mid-file. A trivial example that demonstrates the behavior looks like this:

>>> import datatable as dt
>>> lines = ["1,2,3"] * 1000000
>>> lines[405874] = "1,2,3,4"
>>> text = "\n".join(lines)
>>> dt.fread(text, fill=True)
IOError: Too many fields on line 405875: expected 3 but more are present. <<1,2,3,4>>

Note that the same problem exists in R data.table too, only instead of an error it will emit a warning and return partial data:

> lines = rep("1,2,3", 1000000)
> lines[405874] = "1,2,3,4"
> src = paste(lines, collapse="\n")
> data.table:::fread(src, fill=TRUE)
        V1 V2 V3
     1:  1  2  3
     2:  1  2  3
     3:  1  2  3
     4:  1  2  3
     5:  1  2  3
    ---         
405869:  1  2  3
405870:  1  2  3
405871:  1  2  3
405872:  1  2  3
405873:  1  2  3
Warning message:
In data.table:::fread(src, fill = TRUE) :
  Stopped early on line 405874. Expected 3 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<1,2,3,4>>

This issue can be fixed, though it's not exactly easy.

Apr 29 '21 17:04 st-pasha

@st-pasha Just to comment on your error message, Ignoring the line 405874 should be an ideal fix right? I was in the processing of migrating from pandas to datatable & this was already available with pandas (see my issue #2964 ). Shouldn't fix be pretty straightforward considering instead of just throwing an exception, we can have an additional paramter to ignore this specific exception?

May 17 '21 04:05 siddarthsreeni