cherrymusic icon indicating copy to clipboard operation
cherrymusic copied to clipboard

Folder looks empty when it contains files with accented letters

Open bfg9000d opened this issue 10 years ago • 5 comments

When a folder contains any files that have in its name an accented letter (like á,õ,ã), the whole folder shows up empty. Im running CM from devel using python 3.

Also, this only happens when pure database lookup is off. When it's on, only the single file with accented letters won't show.

This is the error I get when running an update:

[151011-03:38] ERROR   : wrong encoding for filename 'teste/com acento/Nenhum de
 N\udcf3s - Astronauta de Marmore.mp3' (UnicodeEncodeError)
--- Logging error ---
Traceback (most recent call last):
  File "/home/server/github/cherrymusic/cherrymusicserver/sqlitecache.py", line
307, in register_file_with_db
    self.add_to_file_table(fileobj)
  File "/home/server/github/cherrymusic/cherrymusicserver/sqlitecache.py", line
316, in add_to_file_table
    cursor = self.conn.execute('INSERT INTO files (parent, filename, filetype, i
sdir) VALUES (?,?,?,?)', (fileobj.parent.uid if fileobj.parent else -1, fileobj.
name, fileobj.ext, 1 if fileobj.isdir else 0))
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf3' in position 11
: surrogates not allowed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.4/logging/__init__.py", line 980, in emit
    stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\udcf3' in position 29
7: ordinal not in range(128)
Call stack:
  File "/usr/lib/python3.4/threading.py", line 888, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.4/threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "/home/server/github/cherrymusic/cherrymusicserver/util.py", line 49, in
wrapper
    result = func(*args, **kwargs)
  File "/home/server/github/cherrymusic/cherrymusicserver/sqlitecache.py", line
471, in full_update
    self.update_db_recursive(cherry.config['media.basedir'], skipfirst=True)
  File "/home/server/github/cherrymusic/cherrymusicserver/sqlitecache.py", line
540, in update_db_recursive
    self.register_file_with_db(item.infs)
  File "/home/server/github/cherrymusic/cherrymusicserver/sqlitecache.py", line
312, in register_file_with_db
    log.e(_("wrong encoding for filename '%s' (%s)"), fileobj.relpath, e.__class
__.__name__)
  File "/home/server/github/cherrymusic/cherrymusicserver/log.py", line 126, in
error
    _get_logger().error(msg, *args, **kwargs)
Message: "wrong encoding for filename '%s' (%s)"
Arguments: ('teste/com acento/Nenhum de N\udcf3s - Astronauta de Marmore.mp3', '
UnicodeEncodeError')

bfg9000d avatar Oct 11 '15 03:10 bfg9000d

Hello @bfg9000d,

thanks for your report. Currently, we don't have much time to work on CherryMusic. Additionally, the main focus right now is working on a complete rewrite of CherryMusic to fix some long-standing issues. So please be patient if this issue will not be fixed right away.

Also, this only happens when pure database lookup is off.

The pure database lookup feature currently has no effect. No matter what boolean you chose, a pure database lookup is not performed (#149) -- CherryMusic always accesses the filesystem.

6arms1leg avatar Oct 11 '15 13:10 6arms1leg

Thank you for your quick response.

please be patient if this issue will not be fixed right away.

Dont worry about it, I just came here to document the issue, as it took me a while to figure out why half of my folders seemed empty. Also, I would try to fix this (well, im kinda of already trying), but I have no experience in python whatsoever, so I probably wont get anywhere.

The pure database lookup feature currently has no effect.

That is.... strange.... Even though it still accesses the filesystem, it doesnt't glitch out the entire folder, only single files.

bfg9000d avatar Oct 12 '15 06:10 bfg9000d

Hey @bfg9000d,

I did a bit of investigation and it seems that your filesystem is not using unicode to encode the filenames. The character that cannot be decoded is not a valid unicode character.

The character than cannot be correctly encoded is ó, which is 0xF3 in latin-1 and iso-8859-1, but is in your example preceeded by 0xDC, which is, as mentioned before, not a valid utf-8 sequence either.

I have found some evidence that a programs written in haskell could mangle the file names in this way or your files are encoded in a completely different format I am not aware of.

Assuming the file name was encoded in latin-1 instead of some multibyte encoding it would become Nenhum de NÜós instead of Nenhum de Nós... so I'm really out of ideas how the file could ever have such a name...

Can you please tell me the locale you're using? But I'm quite sure that somehow the file names are broken; This does not show in other programs because they don't need to care, but CM needs the filesnames to index the music collection, which is only possible if the names have a standard format.

devsnd avatar Nov 08 '15 15:11 devsnd

This weird encoding is caused by creating these files on Windows 7 and transfering them through FTP (through Filezilla) or the PHP file browser Kloudspeaker. The filenames show up fine everywhere on the Windows side (FTP client, SMB) and on the PHP file browser. The only issue I have in the system is SSH showing only a ? in place of the accent.

Im running all this on a RPI2 running OSMC, which is a debian based distro with KODI preconfigured.

Initially, my locale setting looked like this

LANG=C
LANGUAGE=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=C

But I tried other settings (like pt_BR.UTF-8) and the issue still happens, with the same output.

Thanks for looking into it!

bfg9000d avatar Nov 18 '15 06:11 bfg9000d

@devsnd I think you're on the right track. I'm arriving here from a future ticket (#642), and there, three UTF-8 bytes get an added \xdc. As I said over there: That looks suspiciously like UTF-16 low surrogates.

Is something out there, writing badcoded filenames to people's disks? /* This is Python 3! We should be save here! :scream_cat: */ We can probably work out a limited fix if we get our hands on some actual bytes. I asked @hank in that other issue already, he knows Python.

edit: Could it be UCS-2, a prehistoric Unicode forerunner used by early Windows before UTF-16, and used internally for Py_UNICODE? Not-quite-UTF-16 UCS-2, which apparently enjoys being treated as cp437?

tilboerner avatar Oct 24 '16 21:10 tilboerner