spout icon indicating copy to clipboard operation
spout copied to clipboard

Reader doesn't understand which is the last column (ODS file)

Open eugenio11 opened this issue 7 years ago • 10 comments

Hello, for some files (until now, it happened only with ODS files), it seems the reader wrongly see all the 1024 columns as part of the sheet, even if just three columns are used. Basically the reader, from the fourth column on, returns an empty string for every cell.

I am not sure if it's a spout issue and most of the time it doesn't happen, I cannot reproduce the issue starting from a new ODS file but I have an ODS file that triggers the issue, if you need it.

Thanks.

eugenio11 avatar Jun 08 '18 09:06 eugenio11

@eugenio11 Yes - please provide the file in an anonymized form if possible

madflow avatar Jun 08 '18 12:06 madflow

tet.ods.zip

Hello, here is the file

eugenio11 avatar Jun 09 '18 09:06 eugenio11

Hi @eugenio11 !

Here is what the content file of the ODS looks like:

<table:table-row table:style-name="ro1">
	<table:table-cell office:value-type="string">
		<text:p>nome</text:p>
	</table:table-cell>
	<table:table-cell office:value-type="string">
		<text:p>cognome</text:p>
	</table:table-cell>
	<table:table-cell office:value-type="string" table:style-name="ce3">
		<text:p>indirizzo</text:p>
	</table:table-cell>
	<table:table-cell table:number-columns-repeated="1021"/>
</table:table-row>

The last table-cell line is what causes the 1000+ columns to be added. The fix should be to ignore the line if it's the last one in the row, and if the value is empty.

This is the part that should be modified: https://github.com/box/spout/blob/048105461c79502db3af6b2ab5a4c89ec77909c2/src/Spout/Reader/ODS/RowIterator.php#L245. The current logic only took into account the specific Excel behavior but we should probably make it more generic to avoid adding empty trailing columns

adrilo avatar Jun 10 '18 16:06 adrilo

Hello and thanks for the reply. I am wondering how a file like the one I attached can be produced, I don't remember the steps I did to make it and I am not able to reproduce it. About your fix, what if all the 1000+ columns are actually used?

eugenio11 avatar Jun 11 '18 07:06 eugenio11

Some softwares do add this repeated columns cell for some reasons. Some versions of Excel for instance add this. I'm not sure why... I made the change but ended up breaking some tests, which reminded me that your point is good. Sometimes, it is the desired behavior to have empty cells at the end of the row.

I'm gonna look at the official specs to see what we should do in this case. Eventually, we could have something a bit magic: if the last cell is empty and repeated more than X times, then do not add it; X being a magic number to be determined. This approach should solve 99% of the use cases and be the cause of bugs in the 1% of cases where the last cell is actually repeated a lot on purpose.

adrilo avatar Jun 11 '18 07:06 adrilo

I am wondering how a file like the one I attached can be produced, I don't remember the steps I did to make it and I am not able to reproduce it.

You can reproduce such a file by applying a format to the entire row. Even though only the first cell contains data - LibreOffice at least - will consider the rest of the cells as not empty.

madflow avatar Jun 11 '18 07:06 madflow

Hello, thanks! Yes, the solution you mentioned is the same I wanted to use on my side; I think the best way to implement it could be leaving the correct one as the default behavior and add a parameter that allows to set the "magic number".

eugenio11 avatar Jun 11 '18 07:06 eugenio11

Eventually, we could have something a bit magic: if the last cell is empty and repeated more than X times, then do not add it; X being a magic number to be determined.

I would favor introducing the concept of a spreadsheet header or a reader range to Spout.

  • When the first line should be treated as the header - then the complete results will contain the same count of cells - empty or not - defined by the header.

  • Alternatively one could set a reader range (Options::setReaderRange(int $startColumn, int $endColumn)). This would be basically the same like the header concept - but more explicit.

This would probably not work in some use cases, where there is no header and the range is not known. For us - this would solve issues like this. We always use a headerline to define the spreadsheet.

I feel weird about introducing magic number setters to circumvent vendor quirks.

madflow avatar Jun 11 '18 08:06 madflow

So, I looked at the specs but there's nothing regarding what the behavior should be. This means that a table-cell with 1023 repeated columns actually represents 1024 cells! The way some libraries handle this is by reading the entire spreadsheet and keeping a count of the max num columns for each row. Unfortunately, we can't do that...

So @eugenio11, I'm afraid we can't fix this behavior for now. I guess if you know that your spreadsheet is very small, you can store all the data in memory and do the counting of max num columns yourself to find the actual dimensions of the spreadsheet.

@madflow I like your suggestion about using the range (and probably defaulting the start column to 0). If you can change that even after the reading started, one could read the header, count its number of columns and set the range. That way, there's no need to have specific code to handle header row.

adrilo avatar Jun 11 '18 21:06 adrilo

@adrilo thanks for the info; I guess we can implement the "magic number" feature on our side without reading the entire spreadsheet, just the first (or the second, if the first defines labels) row.

eugenio11 avatar Jun 19 '18 08:06 eugenio11