ModelManager icon indicating copy to clipboard operation
ModelManager copied to clipboard

PgEntity convert null value into empty string

Open tlode opened this issue 8 years ago • 7 comments

In some cases, when using a composite field type and PgEntity converter, a nullable field will be converted to an empty string. This is the case when the nullable field is a string type like varchar, text, uuid, ...

The problem is that PgEntity uses str_getcsv to split the composite row value, essential like this

$data = '28c2cf5f-4df9-4e6f-a5d6-a41eaba35285,"test","",';    
$values = str_getcsv($data);

foreach ($values as $val) {
  var_dump($val) . PHP_EOL;
}

Unfortunately this will return an empty string, where it should not be.

string(36) "28c2cf5f-4df9-4e6f-a5d6-a41eaba35285"
string(4) "test"
string(0) ""
string(0) ""

Depending on the type declared in the RowStructure definition, an empty string may be interpreted as null value within the hydration plan (e.g. in case of boolean, numeric, int, ...), but it is not possible to determine the correct value for string types just from the empty string.

As PgComposite also uses str_getcsv I would assume the same behaviour for that converter. But I've not tested that.

Any idea how to avoid str_getcsv?

tlode avatar Feb 15 '17 16:02 tlode

Unfortunately, the actual implementation seems to be a best effort / most efficient solution. Here is the question I left on stack-overflow.

chanmix51 avatar Feb 20 '17 20:02 chanmix51

I've also experimented with the approach of using a pcre as a specific composite literal parser. Best I've got so far is the one I've found here https://regex101.com/library/oU3yV6

The idea was to keep the quotes in the result fields to distinguish between the empty string or a null value.

With the following modified PgEntity::transformData method, all tests pass, except the one with the escaped hstore value (34,,"",,"{4,3}","""pika"" => ""\\\\\\"chu, rechu""")

private function transformData($data, Projection $projection)
    {
        preg_match_all('/"(?:[^"]|"")*"|[^,\n]+|(?=,)(?<=,)|^|(?<=,)$/', $data, $values);

        $definition     = $projection->getFieldNames();
        $out_values     = [];
        $values_count   = count($values[0]);

        for ($index = 0; $index < $values_count; $index++) {
            $value = '' !== $values[0][$index]
                ? strtr(trim($values[0][$index], '"'), ['""' => '"'])
                : null;

            $out_values[$definition[$index]] = preg_match(':^{.*}$:', $value)
                ? stripcslashes($value)
                : $value;
        }

        return $out_values;
    }

tlode avatar Feb 20 '17 21:02 tlode

This reminds me the dark age of Pomm 1.x with the performance disaster associated with PHP’s preg_match (which simply segfault when the field gets over 8kb).

chanmix51 avatar Feb 21 '17 09:02 chanmix51

I agree. This was more of an experiment I wanted to share. So what could be a solution then?

It is obvious to me that str_getcsv isn't the right solution either, because we don't deal with csv in this context. It's close though. A composite literal must have a well defined format. Using a better, performant regular expression to find the outer most field boundaries, might not be completely off track.

tlode avatar Feb 21 '17 12:02 tlode

The idea raised by Daniel Vérité is to create a state machine for that case. It’s imho a better idea than using a regular expression (which I did in the previous version of Pomm).

chanmix51 avatar Feb 22 '17 10:02 chanmix51

Ok, I’ll sum up this thread of Stack-Overflow here about this bug.

There is a mind gap between PHP and Postgres community. To handle polymorphism properly, Pg does type NULL (no defined value) so NULL::text or NULL::int4 do mean no defined value for this type. For PHP, NULL is not of a type because it is undefined. As the CSV files return strings, theses pieces of string cannot be null because they are typed hence ("a","",,"b") means ['a', '', '', 'b'].

This behavior complies with the RFC4180. Even though this RFC is only informational and has been drew to make the mime-type text/csv compatible with Ms/Excel, it is the model for how PHP parses CSV lines.

There is (almost) no hope to patch the code of the underlying C function.

Choices are now:

  • either to code a state machine to parse CSV in PHP with the performance issues and bugs one can expect
  • or to let this bug as is and document it
  • or to remove this functionality with its bug from Pomm

Any thoughts ?

chanmix51 avatar Mar 20 '17 09:03 chanmix51

I try a short code to have that without regex.

Not perfect, but it is a starting point.

Examples at the end.

Not tested yet onPgEntity::transformData, I wrote just a proof of concept.


class Runner
{
    const CHAR_SEP = ',';
    const CHAR_DBQ = '"';

    protected $inCell      = false;
    protected $cellDbq     = false;
    protected $endCellDbq  = false;
    protected $cell        = null;
    protected $lastIdx     = 0;


    public function __construct(int $lastIdx)
    {
        $this->lastIdx = $lastIdx;
    }

    protected function reinit() {
        $this->cellDbq     = false;
        $this->inCell      = false;
        $this->endCellDbq  = false;
    }

    protected function isLastChar(int $idx): bool
    {
        return $idx === $this->lastIdx;
    }

    public function look(int $idx, string $char)
    {

        if ($this->inCell) {
            // Previous state IN CELL
            if ($this->cellDbq) {
                if (self::CHAR_DBQ === $char) {
                    $this->endCellDbq = true;
                } elseif ($this->endCellDbq && self::CHAR_SEP === $char) {
                    $this->reinit();
                } else {
                    $this->cell .= $char;
                }
            } else {
                if (self::CHAR_SEP !== $char) {
                    $this->cell .= $char;
                }
    
                if (self::CHAR_SEP === $char) {
                    $this->reinit();
                }
            }
        } else {
            // Previous state NOT IN CELL
            if (self::CHAR_SEP === $char) {
                $this->reinit();
            } else {
                $this->inCell = true;
                if (self::CHAR_DBQ === $char) {
                    $this->cellDbq = true;
                    $this->cell = '';
                } else {
                    $this->cellDbq = false;
                    $this->cell .= $char;
                }
            }
        }
    }

    public function canRelease(): bool
    {
        return !$this->inCell && !$this->cellDbq && !$this->endCellDbq;
    }

    public function getCell()
    {
        $out = $this->cell;
        $this->cell = null;

        return $out;
    }
}




class CompositeValueExtractor
{
    protected $encoding;

    public function __construct(string $forceEncoding = '')
    {
        if (!empty($forceEncoding)) {
            $this->encoding = $forceEncoding;
        } else {
            $this->encoding = mb_internal_encoding();
        }
    }

    public function __invoke(string $str): array
    {
        $out = [];

        $len = mb_strlen($str, $this->encoding);
        $end = $len - 1;
        $runner = new Runner($end);

        for ($i = 0; $i <= $end; $i++) {
            $char = mb_substr($str, $i, 1);

            $runner->look($i, $char);

            if ($runner->canRelease()) {
                $out[] = $runner->getCell();
            }
        }

        // end of string, so getting last cell content
        $out[] = $runner->getCell();
        

        return $out;
    }
}


$tests = [
    'cas"tordu,"in","",one,,"pika, chu",,"out"',
    ',"something",',
    ',',
    ',"",',
    '28c2cf5f-4df9-4e6f-a5d6-a41eaba35285,"test","",'
];

$cve = new CompositeValueExtractor();

foreach ($tests as $test) {
    $res = $cve($test);

    var_dump($res);
}

Results are:

array(8) {
  [0]=>
  string(9) "cas"tordu"
  [1]=>
  string(2) "in"
  [2]=>
  string(0) ""
  [3]=>
  string(3) "one"
  [4]=>
  NULL
  [5]=>
  string(9) "pika, chu"
  [6]=>
  NULL
  [7]=>
  string(3) "out"
}
array(3) {
  [0]=>
  NULL
  [1]=>
  string(9) "something"
  [2]=>
  NULL
}
array(2) {
  [0]=>
  NULL
  [1]=>
  NULL
}
array(3) {
  [0]=>
  NULL
  [1]=>
  string(0) ""
  [2]=>
  NULL
}
array(4) {
  [0]=>
  string(36) "28c2cf5f-4df9-4e6f-a5d6-a41eaba35285"
  [1]=>
  string(4) "test"
  [2]=>
  string(0) ""
  [3]=>
  NULL
}

What do you think about that?

malenkiki avatar Jul 04 '19 19:07 malenkiki