simplexlsx icon indicating copy to clipboard operation
simplexlsx copied to clipboard

Don't parse all numbers using `if ( is_numeric( ... ) )`

Open alinnert opened this issue 2 years ago • 4 comments

Currently, every value gets checked if it's a number using is_numeric(). There are some types of values that pass this check depending on the concrete value. ZIP codes for instance: 12345 => number, but 01234 => string. Or color codes: 000000 => number, but ffffff => string.

This makes working with the data really difficult. If JS consumes this data using an API and I want to convert all color codes to lowercase it would look like this:

// assuming that `color` is of type `string | number`
const lowercaseColor = typeof color === 'string'
  ? color.toLowerCase()
  : color.toString().toLowerCase()

If I knew every color was a string I could just do:

const lowercaseColor = color.toLowerCase()

I have a few ideas how to solve this:

1) Add an option whether numbers should be parsed or not

$result = SimpleXLSX::parse('file.xlsx', parseNumbers: false);

This syntax requires PHP 8.0 or later. Users of older versions would need to write:

$result = SimpleXLSX::parse('file.xlsx', false, false, false);

This could be made nicer using an options array because you can skip all options you don't want to change:

$result = SimpleXLSX::parse($data, ['debug' => true, 'is_data' => true, 'parse_numbers' => false]);
$result = SimpleXLSX::parse('file.xlsx', ['parse_numbers' => false]);

I know, the options array makes the library more complex and is only helpful for older PHP versions. So, this is totally optional IMO.

If the option is set to false numbers can still be parsed later using intval() or floatval(). So, it's still pretty flexible.

2) Store the original value in a separate field

$cell['value'] // -> 123
$cell['rawValue'] // -> "123"

Easier to implement I guess, but probably does unnecessary work and needs a bit more memory.

alinnert avatar Jul 14 '23 12:07 alinnert

https://github.com/shuchkin/simplexlsx#using-rowsex-to-extract-cell-info

shuchkin avatar Jul 14 '23 16:07 shuchkin

Yes, I know this. What I'm saying is: It is not great how this currently works. Converting every potential number is not a good idea. Please let us developers control what gets converted to a number and what doesn't.

alinnert avatar Jul 14 '23 17:07 alinnert

There not only is_numeric, see default case:

case 's':
    // Value is a shared string
    if ((string)$cell->v !== '') {
        $value = $this->sharedstrings[(int)$cell->v];
    }
    break;

case 'str': // formula?
    if ((string)$cell->v !== '') {
        $value = (string)$cell->v;
    }
    break;

case 'b':
    // Value is boolean
    $value = (string)$cell->v;
    if ($value === '0') {
        $value = false;
    } elseif ($value === '1') {
        $value = true;
    } else {
        $value = (bool)$cell->v;
    }

    break;

case 'inlineStr':
    // Value is rich text inline
    $value = $this->_parseRichText($cell->is);

    break;

case 'e':
    // Value is an error message
    if ((string)$cell->v !== '') {
        $value = (string)$cell->v;
    }
    break;

case 'D':
    // Date as float
    if (!empty($cell->v)) {
        $value = $this->datetimeFormat ? gmdate($this->datetimeFormat, $this->unixstamp((float)$cell->v)) : (float)$cell->v;
    }
    break;

case 'd':
    // Date as ISO YYYY-MM-DD
    if ((string)$cell->v !== '') {
        $value = (string)$cell->v;
    }
    break;

default:
    // Value is a string
    $value = (string)$cell->v;

    // Check for numeric values
    if (is_numeric($value)) {
        /** @noinspection TypeUnsafeComparisonInspection */
        if ($value == (int)$value) {
            $value = (int)$value;
        } /** @noinspection TypeUnsafeComparisonInspection */ elseif ($value == (float)$value) {
            $value = (float)$value;
        }
    }

this code cover 99% common data, use 'format' from rowsEx meta data to detect formatting.

PS1: In XML there only numbers integers or floats. In start of developing this class, I decided to leave the formatting to the developers (parsed only dates)

PS2: I have correct xlsx, when colors as strings Screenshot_917

public function testFFFFFF() {
    $xlsx = SimpleXLSX::parseFile(__DIR__ . '/../xlsx/ffffff.xlsx', true);
    print_r($xlsx->rows());
}

and I have correct results

Array
(
    [0] => Array
        (
            [0] => 000000
        )

    [1] => Array
        (
            [0] => ffffff
        )

)

PS3: Just comment

    if (is_numeric($value)) {
        /** @noinspection TypeUnsafeComparisonInspection */
        if ($value == (int)$value) {
            $value = (int)$value;
        } /** @noinspection TypeUnsafeComparisonInspection */ elseif ($value == (float)$value) {
            $value = (float)$value;
        }
    }

And deny updates from composer

shuchkin avatar Jul 15 '23 04:07 shuchkin

Oh, yes, I'm sorry. The color 000000 gets treated as a string because of the 0s. My bad. But 111111 would get parsed into a number.

Anyway, I've looked a little more into the format property. In Excel I can change the format of a cell by right clicking it, select "Format Cells..." and choose another format in the dialog. There's a "Text" format. If I select that the value of $cell['format'] becomes "@". Would it make sense to check for this format in the default case? Like this:

default:
    $value = (string) $cell->v;

    /* I'm not sure how to really check for the format here. But something like this. */
    if ( $cell['format'] !== '@' && is_numeric( $value ) ) {
        /* ... do number conversion ... */
    }

Would that be ok?

EDIT:

I've looked into the source code. I see it's not that simple. format is not yet available at this point. So, that's a problem. Hm...

alinnert avatar Jul 15 '23 11:07 alinnert