fix unicode content parsing
fixed exception in parsing symbols like em dash or cyrillic letters. Tests are included
Your test compiles fine without applying the patch.
Sorry, I forgot explain when it happens. If you have string overloading with this settings: mbstring.internal_encoding = UTF-8 mbstring.func_overload = 2
then you have exception on "parse error: failed at `content: '—';". Specifically, this row mbstring.internal_encoding = UTF-8 causes parse error. With only mbstring.func_overload = 2 parsing goes well.
The mbstring.func_overload ini setting can't be changed at run-time; also, I've run into systems where the mbstring extension isn't enabled.
How about prefixing strlen() and substr() calls with '_', and then adding these helper functions?
// mb_orig_strlen() and mb_orig_substr() only exist when mbstring.func_overload & 2 is true and the mbstring extension is loaded
if (function_exists('mb_orig_strlen')) {
function _strlen($s) {
return mb_orig_strlen($s);
}
function _substr() {
return call_user_func_array('mb_orig_substr', func_get_args());
}
} else {
function _strlen($s) {
return strlen($s);
}
function _substr() {
return call_user_func_array('substr', func_get_args());
}
}
The language itself doesn't use unicode so the parser doesn't need to be aware of it (no unicode keywords or anything). The buffer can be treated as a string of 8 bit chars and still have unicode characters pass through it fine. I'm guessing the problem is that mbstring.func_overload causes the string functions to work differently than the $buffer[x] character accessor which causes a mismatch and the parser to fail?
Does changing the ini setting with ini_set just for the parse work? If that doesn't work then another option is to replace all instances of $this->buffer[] with substr($this->buffer, ...). There's probably a noticeable performance hit to this change though.
Does changing the ini setting with ini_set just for the parse work?
Yes, you are right.. Changing compile method on this
public function compile($string, $name = null) {
$locale = setlocale(LC_NUMERIC, 0);
setlocale(LC_NUMERIC, "C");
$internal_encoding = NULL;
// only if mbstring installed
if (function_exists('mb_orig_strlen')) {
$internal_encoding = ini_get('mbstring.internal_encoding');
if(!is_null($internal_encoding)){
ini_set('mbstring.internal_encoding', NULL);
}
}
$this->parser = $this->makeParser($name);
$root = $this->parser->parse($string);
$this->env = null;
$this->scope = null;
$this->formatter = $this->newFormatter();
if (!empty($this->registeredVars)) {
$this->injectVariables($this->registeredVars);
}
$this->sourceParser = $this->parser; // used for error messages
$this->compileBlock($root);
ob_start();
$this->formatter->block($this->scope);
$out = ob_get_clean();
setlocale(LC_NUMERIC, $locale);
if(!is_null($internal_encoding)){
ini_set('mbstring.internal_encoding', $internal_encoding);
}
return $out;
}
helps and parsing is going well even if mbstring.internal_encoding = UTF-8 is set in php.ini or .htaccess
Thanks for fix.