Don't parse all numbers using `if ( is_numeric( ... ) )`
Currently, every value gets checked if it's a number using is_numeric(). There are some types of values that pass this check depending on the concrete value. ZIP codes for instance: 12345 => number, but 01234 => string. Or color codes: 000000 => number, but ffffff => string.
This makes working with the data really difficult. If JS consumes this data using an API and I want to convert all color codes to lowercase it would look like this:
// assuming that `color` is of type `string | number`
const lowercaseColor = typeof color === 'string'
? color.toLowerCase()
: color.toString().toLowerCase()
If I knew every color was a string I could just do:
const lowercaseColor = color.toLowerCase()
I have a few ideas how to solve this:
1) Add an option whether numbers should be parsed or not
$result = SimpleXLSX::parse('file.xlsx', parseNumbers: false);
This syntax requires PHP 8.0 or later. Users of older versions would need to write:
$result = SimpleXLSX::parse('file.xlsx', false, false, false);
This could be made nicer using an options array because you can skip all options you don't want to change:
$result = SimpleXLSX::parse($data, ['debug' => true, 'is_data' => true, 'parse_numbers' => false]);
$result = SimpleXLSX::parse('file.xlsx', ['parse_numbers' => false]);
I know, the options array makes the library more complex and is only helpful for older PHP versions. So, this is totally optional IMO.
If the option is set to false numbers can still be parsed later using intval() or floatval(). So, it's still pretty flexible.
2) Store the original value in a separate field
$cell['value'] // -> 123
$cell['rawValue'] // -> "123"
Easier to implement I guess, but probably does unnecessary work and needs a bit more memory.
https://github.com/shuchkin/simplexlsx#using-rowsex-to-extract-cell-info
Yes, I know this. What I'm saying is: It is not great how this currently works. Converting every potential number is not a good idea. Please let us developers control what gets converted to a number and what doesn't.
There not only is_numeric, see default case:
case 's':
// Value is a shared string
if ((string)$cell->v !== '') {
$value = $this->sharedstrings[(int)$cell->v];
}
break;
case 'str': // formula?
if ((string)$cell->v !== '') {
$value = (string)$cell->v;
}
break;
case 'b':
// Value is boolean
$value = (string)$cell->v;
if ($value === '0') {
$value = false;
} elseif ($value === '1') {
$value = true;
} else {
$value = (bool)$cell->v;
}
break;
case 'inlineStr':
// Value is rich text inline
$value = $this->_parseRichText($cell->is);
break;
case 'e':
// Value is an error message
if ((string)$cell->v !== '') {
$value = (string)$cell->v;
}
break;
case 'D':
// Date as float
if (!empty($cell->v)) {
$value = $this->datetimeFormat ? gmdate($this->datetimeFormat, $this->unixstamp((float)$cell->v)) : (float)$cell->v;
}
break;
case 'd':
// Date as ISO YYYY-MM-DD
if ((string)$cell->v !== '') {
$value = (string)$cell->v;
}
break;
default:
// Value is a string
$value = (string)$cell->v;
// Check for numeric values
if (is_numeric($value)) {
/** @noinspection TypeUnsafeComparisonInspection */
if ($value == (int)$value) {
$value = (int)$value;
} /** @noinspection TypeUnsafeComparisonInspection */ elseif ($value == (float)$value) {
$value = (float)$value;
}
}
this code cover 99% common data, use 'format' from rowsEx meta data to detect formatting.
PS1: In XML there only numbers integers or floats. In start of developing this class, I decided to leave the formatting to the developers (parsed only dates)
PS2: I have correct xlsx, when colors as strings
public function testFFFFFF() {
$xlsx = SimpleXLSX::parseFile(__DIR__ . '/../xlsx/ffffff.xlsx', true);
print_r($xlsx->rows());
}
and I have correct results
Array
(
[0] => Array
(
[0] => 000000
)
[1] => Array
(
[0] => ffffff
)
)
PS3: Just comment
if (is_numeric($value)) {
/** @noinspection TypeUnsafeComparisonInspection */
if ($value == (int)$value) {
$value = (int)$value;
} /** @noinspection TypeUnsafeComparisonInspection */ elseif ($value == (float)$value) {
$value = (float)$value;
}
}
And deny updates from composer
Oh, yes, I'm sorry. The color 000000 gets treated as a string because of the 0s. My bad. But 111111 would get parsed into a number.
Anyway, I've looked a little more into the format property. In Excel I can change the format of a cell by right clicking it, select "Format Cells..." and choose another format in the dialog. There's a "Text" format. If I select that the value of $cell['format'] becomes "@". Would it make sense to check for this format in the default case? Like this:
default:
$value = (string) $cell->v;
/* I'm not sure how to really check for the format here. But something like this. */
if ( $cell['format'] !== '@' && is_numeric( $value ) ) {
/* ... do number conversion ... */
}
Would that be ok?
EDIT:
I've looked into the source code. I see it's not that simple. format is not yet available at this point. So, that's a problem. Hm...