fgetcsv() cause wrong parse on specific UTF-8 BOM encoded CSV
Description
When the CSV file that its the first value is empty and they are quoted with the double quotation mark,
PHP function fgetcsv parses the double quotation mark as a string value , only when the file was encoded with UTF-8 BOM.
Use the CSV file saved encoding with UTF-8 BOM.
"","key","key2"
"name","val","val2"
*The FIRST CELL is EMPTY
The following code:
<?php
$csvFile = fopen("utf8Bom.csv", "r");
while($line = fgetcsv($csvFile)) {
for($idx = 0, $size = count($line); $idx < $size; $idx++ ){
echo "[". $line[$idx] ."]";
}
echo "<br>";
}
fclose($csvFile);
?>
Resulted in this output:
[""][key][key2]
[name][val][val2]
The first cell has "" as a value.
But I expected this output instead:
[][key][key2]
[name][val][val2]
NOTE: the CSV file that encoded with SJIS and the same values parses the first "" empty cell as a empty value. It's collect.
Environment: IIS on WinServer2022 Please test on PHP 8.x who can use it..
PHP Version
PHP 7.4.29
Operating System
Windows Server 2022
I tested with PHP 8.0.13 and got the same result.
Environment:
Windows10 Apache. PHP 8.0.13

$csvFile = fopen("utf8Bom.csv", "r");
fseek($csvFile, 3); //seek to after BOM
while($line = fgetcsv($csvFile)) {
...
Insert the second line helps to parse correct.
There is no special treatment for BOMs; instead these are just treated as arbitrary characters would, and since a field starting with arbitrary characters followed by a pair of double-quotes is not really valid CSV, this is what you get. See https://3v4l.org/NZqhl for a better reproducer which shows what's happening.
Unfortunately, CSV has never been standardized; the closest is likely the informational RFC 4180, but that doesn't clarify how such character sequences are supposed to be handled. And that RFC does not even mention how different character encodings are to be handled (besides it mentions the MIME type, but that does not apply when reading from a local file), let alone Unicode or even BOM.
Given that the Unicode standard does not recommend a BOM with UTF-8 encoding, and that PHP otherwise ignores BOMs, I think we should close this as WONTFIX (or better to document the fact).
@Girgias, any thoughts on BOMs in CSV files? How are these handled by PECL/csv?
I've never thought about that, so I would need to test with PECL/csv how it behaves, but I'm expecting it to throw an error due to the incorrect handling of the enclosure character. Indeed by the spec:
- Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example:
"aaa","bbb","ccc" CRLF
zzz,yyy,xxx
- Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
- If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:
"aaa","b""bb","ccc"
So I'm not sure what should be the actual behaviour (but that's a great test case for PECL/csv)
I just tried on PECL/csv and the behaviour currently is:
array(3) {
[0]=>
string(3) ""
[1]=>
string(3) "key"
[2]=>
string(4) "key2"
}
for:
<?php
$string = "\xEF\xBB\xBF\"\",\"key\",\"key2\"\r\n";
var_dump(CSV::rowToArray($string));
Which is, something
I just tried on PECL/csv and the behaviour currently is:
Would you get the same result for some arbitrary bytes instead of a BOM?
I just tried on PECL/csv and the behaviour currently is:
Would you get the same result for some arbitrary bytes instead of a BOM?
Yes:
<?php
$string = "abcd\"\",\"key\",\"key2\"\r\n";
var_dump(CSV::rowToArray($string));
gives me:
array(3) {
[0]=>
string(4) "abcd"
[1]=>
string(3) "key"
[2]=>
string(4) "key2"
}
But I'm kinda considering this a bug as:
<?php
$string = "key1,ke\"y2,key3\r\n";
var_dump(CSV::rowToArray($string));
gives me:
array(2) {
[0]=>
string(4) "key1"
[1]=>
string(11) "key2,key3
"
}
I just found https://bugs.php.net/49350 and https://bugs.php.net/63433, and both have been closed as not-a-bug.
@cmb69 I would also tend to say that it's not a bug. There is a comment in the documentation talking about this behaviour and how to handle the case. Maybe we should just document it more officially ?
There is a comment in the documentation talking about this behaviour and how to handle the case. Maybe we should just document it more officially ?
Yeah, I noticed that user comment, and agree that we should properly document this, if we choose not to fix it.
The main problem is that Excel by default saved CSV file with BOM. And MS Excel is one of the massively used software tools to work with CSV\sheets...