doc-en Line break escape sequence (\R) documentation is incomplete

Description

The line break escape sequence (\R) seems to break certain UTF-8 characters, in this example ą.

The following code:

<?php
$line = "Urządzenie Z-Wave nie odpowiedziało.";
echo preg_replace("/\R+/", "\n", $line);
echo preg_replace("/[\n\r]+/", "\n", $line);

Resulted in this output:

Urz�
dzenie Z-Wave nie odpowiedziało.
Urządzenie Z-Wave nie odpowiedziało.

But I expected this output instead:

Urządzenie Z-Wave nie odpowiedziało.
Urządzenie Z-Wave nie odpowiedziało.

PHP Version

8.3.17/8.4.4

Operating System

No response

Feb 24 '25 08:02 bobvandevijver

That's interesting.

ą is c485 in UTF-8. PCRE misunderstood U+0085. U+0085 is Next Line. Ref: https://www.compart.com/en/unicode/U+0085 So \R is match in U+0085.

/u modifier is works fine in UTF-8. Please see: https://3v4l.org/btQGs

Feb 24 '25 09:02 youkidearitai

Looks like it's broken since PHP 5.2.2 which introduced PCRE 7.0.

Feb 24 '25 10:02 alecpl

Anyway, this is not PHP's bug (also seems not PCRE's bug). So I will close.

Feb 24 '25 10:02 youkidearitai

Looks like a bug to me.

Feb 24 '25 11:02 alecpl

That is documented behavior:

Outside a character class, by default, the escape sequence \R matches any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent to the following:

(?>\r\n|\n|\x0b|\f|\r|\x85)

Feb 24 '25 11:02 cmb69

Then maybe it should be clearly stated also at https://www.php.net/manual/en/regexp.reference.escape.php

Feb 24 '25 11:02 alecpl

I agree with @alecpl. The PHP documentation might be somewhat limited as there are effectively two modes that are now not clear from there:

\R line break: matches \n, \r and \r\n

Feb 24 '25 11:02 bobvandevijver

Agreed this is a documentation bug. Moving.

Feb 24 '25 18:02 ndossche