Line break escape sequence (\R) documentation is incomplete
Description
The line break escape sequence (\R) seems to break certain UTF-8 characters, in this example ą.
The following code:
<?php
$line = "Urządzenie Z-Wave nie odpowiedziało.";
echo preg_replace("/\R+/", "\n", $line);
echo preg_replace("/[\n\r]+/", "\n", $line);
Resulted in this output:
Urz�
dzenie Z-Wave nie odpowiedziało.
Urządzenie Z-Wave nie odpowiedziało.
But I expected this output instead:
Urządzenie Z-Wave nie odpowiedziało.
Urządzenie Z-Wave nie odpowiedziało.
PHP Version
8.3.17/8.4.4
Operating System
No response
That's interesting.
ą is c485 in UTF-8. PCRE misunderstood U+0085. U+0085 is Next Line. Ref: https://www.compart.com/en/unicode/U+0085
So \R is match in U+0085.
/u modifier is works fine in UTF-8.
Please see: https://3v4l.org/btQGs
Looks like it's broken since PHP 5.2.2 which introduced PCRE 7.0.
Anyway, this is not PHP's bug (also seems not PCRE's bug). So I will close.
Looks like a bug to me.
That is documented behavior:
Outside a character class, by default, the escape sequence \R matches any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent to the following:
(?>\r\n|\n|\x0b|\f|\r|\x85)
Then maybe it should be clearly stated also at https://www.php.net/manual/en/regexp.reference.escape.php
I agree with @alecpl. The PHP documentation might be somewhat limited as there are effectively two modes that are now not clear from there:
\R line break: matches \n, \r and \r\n
Agreed this is a documentation bug. Moving.