Attachments (file)names are not correctly decoded
Describe the bug
In some cases, the attachments (file)names are not correctly decoded and contain invalid characters. This happens for names encoded like this: ISO-8859-1''caf%E9.txt. Note that it's not using encoded-words (btw, I cannot find the name of this encoding, do you know it?). The ISO-8859-1 encoding is simply ignored.
Used config
'options' => [
'decoder' => [
'message' => 'iconv',
'attachment' => 'iconv',
],
],
Code to Reproduce
$clientManager = new \Webklex\PHPIMAP\ClientManager();
$clientManager->setConfig([
'options' => [
'decoder' => [
'message' => 'iconv',
'attachment' => 'iconv',
],
],
]);
$email = file_get_contents(__DIR__ . '/email.txt');
$message = \Webklex\PHPIMAP\Message::fromString($email);
foreach ($message->getAttachments() as $attachment) {
$name = $attachment->getName();
echo "Attachment: {$name}\n";
}
You can find an example of problematic email: email.txt (generated with Gnome Evolution).
Expected behavior
The attachment name should be café.txt, but it is caf�.txt.
Desktop / Server (please complete the following information):
- OS: Docker image
php:8.1-fpm(Debian I guess?) - PHP: 8.1
- Version: 5.5.0
- Provider: Gnome Evolution
Additional context
I was able to spot the issue.
In Attachment::decodeName, you test that $name contains the string '' and get the "real" name from it, but you drop the encoding. In my example, ISO-8859-1''caf%E9.txt becomes caf%E9.txt.
Few lines later, you urldecode() the name. Unfortunately, in my case, %E9 is ISO-8859-1 for the character é, while it would be %C3%A9 in UTF-8. Meaning that we still need to convert the string from ISO-8859-1 to UTF-8 with EncodingAliases::convert($name, $encoding) ($encoding being $parts[0] extracted earlier).
I had the same problem but with another config.
$clientManager->setConfig([
'options' => [
'decoder' => [
'message' => 'utf-8',
'attachment' => 'utf-8',
],
],
]);
My solution is to convert the name of the attachment lik this:
echo mb_convert_encoding($attachment->getName(), 'UTF-8', 'ISO-8859-1');
I did something similar too. The problem with this solution is that we don't know the encoding of the initial string. Meaning that if it's not ISO-8859-1, we end with the same issue (the unsupported characters � being replaced by question marks, which may look nicer). This has to be done at the PHP-IMAP level to work properly. Or can we access the raw name (e.g. ISO-8859-1''caf%E9.txt) to extract the encoding ourselves?
Side note: the issue happens also with the UTF-8 decoder indeed. I've been back to this decoder: the issues that I had with it have been fixed after installing the PHP ldap extension. It would be worth a separated issue in GitHub but I don't have much time these days. Don't hesitate to get back to me on this subject after the holidays :)