plexus-archiver Tests fail when locale is not UTF-8

Trying to build plexus-archiver v3.7.0 and v4.2.2 on ppc64le platform, however facing the following test case error:-

[ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.102 s <<< FAILURE! - in org.codehaus.plexus.archiver.zip.ZipUnArchiverTest
[ERROR] testUnarchiveUtf8(org.codehaus.plexus.archiver.zip.ZipUnArchiverTest)  Time elapsed: 0.021 s  <<< FAILURE!
junit.framework.AssertionFailedError
        at org.codehaus.plexus.archiver.zip.ZipUnArchiverTest.testUnarchiveUtf8(ZipUnArchiverTest.java:86)

Apr 15 '20 12:04 sarveshtamba

This is a locale error. It assumes you use xx_YY.UTF-8. What is your locale?

Apr 15 '20 13:04 michael-o

sh-4.2# locale LANG= LC_CTYPE="POSIX" LC_NUMERIC="POSIX" LC_TIME="POSIX" LC_COLLATE="POSIX" LC_MONETARY="POSIX" LC_MESSAGES="POSIX" LC_PAPER="POSIX" LC_NAME="POSIX" LC_ADDRESS="POSIX" LC_TELEPHONE="POSIX" LC_MEASUREMENT="POSIX" LC_IDENTIFICATION="POSIX" LC_ALL=

Apr 15 '20 13:04 sarveshtamba

This will not work. We need to test for file.encoding/Charset#default() and skip such tests.

Can you disable this test and see whether the rest works?

Apr 15 '20 13:04 michael-o

I can reproduce with LC_ALL=C mvn verify:

[ERROR] Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.104 s <<< FAILURE! - in org.codehaus.plexus.archiver.zip.ZipUnArchiverTest
[ERROR] testUnarchiveUtf8(org.codehaus.plexus.archiver.zip.ZipUnArchiverTest)  Time elapsed: 0.021 s  <<< FAILURE!
junit.framework.AssertionFailedError
        at org.codehaus.plexus.archiver.zip.ZipUnArchiverTest.testUnarchiveUtf8(ZipUnArchiverTest.java:86)

on

$ LC_ALL=C mvn -v
Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T20:33:14+02:00)
Maven home: /usr/local/apache-maven-3.5.4
Java version: 1.8.0_242, vendor: Oracle Corporation, runtime: /usr/local/openjdk8/jre
Default locale: en_US, platform encoding: US-ASCII
OS name: "freebsd", version: "12.1-stable", arch: "amd64", family: "unix"

Apr 15 '20 13:04 michael-o

These tests need to be viewed whether they work as intended:

$ grep -ri -E -e utf-?8 src/test/java/
src/test/java/org/codehaus/plexus/archiver/jar/DirectoryArchiverUnpackJarTest.java:        archiver.addArchivedFileSet( afs, Charset.forName( "UTF-8" ) );
src/test/java/org/codehaus/plexus/archiver/tar/TarArchiverTest.java:        File tmpDir = getTestFile( "src/test/resources/utf8" );
src/test/java/org/codehaus/plexus/archiver/zip/ConcurrentJarCreatorTest.java:        zos.setEncoding( "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/PlexusIoZipFileResourceCollectionTest.java:                final String manifest = IOUtils.toString( contents1, "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:        zipArchiver.setEncoding( "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:        zipArchiver2.setEncoding( "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:        zipArchive.setEncoding( "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:    public void testDefaultUTF8()
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:        final ZipArchiver zipArchiver = getZipArchiver( new File( "target/output/utf8-default.zip" ) );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:        zipArchiver.addDirectory( new File( "src/test/resources/miscUtf8" ) );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:    public void testDefaultUTF8withUTF8()
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:        final ZipArchiver zipArchiver = getZipArchiver( new File( "target/output/utf8-with_utf.zip" ) );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:        zipArchiver.setEncoding( "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:        zipArchiver.addDirectory( new File( "src/test/resources/miscUtf8" ) );
src/test/java/org/codehaus/plexus/archiver/zip/ZipUnArchiverTest.java:    public void testUnarchiveUtf8()
src/test/java/org/codehaus/plexus/archiver/zip/ZipUnArchiverTest.java:        File dest = new File( "target/output/unzip/utf8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipUnArchiverTest.java:        final File zipFile = new File( "target/output/unzip/utf8-default.zip" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipUnArchiverTest.java:        zipArchiver.addDirectory( new File( "src/test/resources/miscUtf8" ) );

Apr 15 '20 14:04 michael-o

I tried to set the locale using export LC_ALL=en_US.UTF-8, post this the build was successful.

[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ plexus-archiver ---
[INFO] Installing /root/plexus-archiver-master/target/plexus-archiver-4.2.3-SNAPSHOT.jar to /root/.m2/repository/org/codehaus/plexus/plexus-archiver/4.2.3-SNAPSHOT/plexus-archiver-4.2.3-SNAPSHOT.jar
[INFO] Installing /root/plexus-archiver-master/pom.xml to /root/.m2/repository/org/codehaus/plexus/plexus-archiver/4.2.3-SNAPSHOT/plexus-archiver-4.2.3-SNAPSHOT.pom
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  41.100 s
[INFO] Finished at: 2020-04-15T13:41:29Z
[INFO] ------------------------------------------------------------------------

Apr 16 '20 03:04 sarveshtamba

Hi @sarveshtamba, this is very interesting issue. Thanks for reporting it. It seems that there is a bug in Plexus Archiver when working with files containing unicode characters when the system locale is not UTF-8.

I've set my locale to POSIX to reproduce your environment. Here is what is the behavior on my system, would you please confirm that you see the same on yours.

When I clone Plexus Archiver the files are cloned:

$ ls src/test/resources/miscUtf8/
aFileWithA#.html
'aPi'$'\303\261''ata.txt'
'an'$'\303\274''mlaut.txt'
''$'\342\202\254''uro.txt'

Although ls shows the files with escape characters they are there and their names are the same as in the repository. When I ran the maven build the files with special characters are missing from the output directory:

$ mvn clean verify
$ ls target/output/unzip/utf8
aFileWithA#.html

The generated zip file also includes a single file:

$ unzip -l target/output/unzip/utf8-default.zip
Archive:  target/output/unzip/utf8-default.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
       20  2020-04-17 10:06   aFileWithA#.html
---------                     -------
       20                     1 file

If I use zip and unzip everything works as expected - all files are compressed and with the correct names. So this is not some limitation of the system or ZIP itself. It is a defect in Plexus Archiver.

When I change the locale to en_US.UTF-8 Plexus Archiver also behaves as expected:

$ LC_ALL=en_US.UTF-8 mvn verify
$ ls target/output/unzip/utf8
aFileWithA#.html
'aPi'$'\303\261''ata.txt'
'an'$'\303\274''mlaut.txt'
''$'\342\202\254''uro.txt'
$ unzip -l target/output/unzip/utf8-default.zip
Archive:  target/output/unzip/utf8-default.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
       31  2020-04-17 10:06   €uro.txt
       20  2020-04-17 10:06   aFileWithA#.html
       39  2020-04-17 10:06   anümlaut.txt
       29  2020-04-17 10:06   aPiñata.txt
---------                     -------
      119                     4 files

p.s. I'm testing on Ubuntu, ext4 file system with locale set to POSIX, but I would expect that the behavior is the same on other POSIX/Unix like systems.

Apr 17 '20 07:04 plamentotev

The problem is Java cannot properly map bytes to characters when encoding is wrong. Unix filesystems are not charset aware. They simply store bytes, not codepoints.

Apr 17 '20 08:04 michael-o

Thanks for the inputs @michael-o @plamentotev do you still want me to verify this?

Apr 17 '20 08:04 sarveshtamba

@michael-o thanks for the tip. It really looks like the character encoding is the problem. Still it looks like if Path and URI are used Java can work with such files as expected. The URI has the bytes properly escaped. As now Java 7 is required maybe we can look into those "new" APIs in order to better support use cases as the one reported here.

@sarveshtamba thanks. I think I understood where the issue is, so no need to verify it.

Apr 17 '20 14:04 plamentotev

@plamentotev @michael-o thanks for the inputs.

Apr 17 '20 14:04 sarveshtamba

This is not an issue for plexus-archiver, it's how Java works, Java uses the locale from the operating system, if the OS is configured with a non-utf8 locale, then Java will use that, and not even the new Java 7 APIs will help here:

Accented or extended UTF-8 characters cause "Malformed input or input contains unmappable characters" error.

java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters: target/piñata.txt
        at java.base/sun.nio.fs.UnixPath.encode(UnixPath.java:145)
        at java.base/sun.nio.fs.UnixPath.<init>(UnixPath.java:69)
        at java.base/sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:280)
        at java.base/java.io.File.toPath(File.java:2290)

Java 11 won't support setting sun.jnu.encoding to UTF-8 via the command line to use UTF-8 for encoding file paths. It will silently ignore it and will not have any effect.

So the only possible and real solution is to use the correct locale, at least C.UTF-8. LANG=C.UTF-8 mvn verify

So, as this is a Java thing, even trying to clean the project with LANG=C mvn clean will fail if the target directory contains a UTF-8 encoded filename.

Sep 09 '21 17:09 jorsol

UnixPath operates on raw bytes, WindowsPath on wchar_t. I still think those tests need to be skipped with a warning.

Sep 09 '21 18:09 michael-o

UnixPath operates on raw bytes, WindowsPath on wchar_t. I still think those tests need to be skipped with a warning.

Well, just to show a warning message that a utf-8 locale is required, it works, but just hides the real issue.

Sep 09 '21 18:09 jorsol

UnixPath operates on raw bytes, WindowsPath on wchar_t. I still think those tests need to be skipped with a warning.

Well, just to show a warning message that a utf-8 locale is required, it works, but just hides the real issue.

Correct, but unfortunately I don't see a better portable way on POSIX-like systems.

Sep 09 '21 18:09 michael-o