Tests fail when locale is not UTF-8
Trying to build plexus-archiver v3.7.0 and v4.2.2 on ppc64le platform, however facing the following test case error:-
[ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.102 s <<< FAILURE! - in org.codehaus.plexus.archiver.zip.ZipUnArchiverTest
[ERROR] testUnarchiveUtf8(org.codehaus.plexus.archiver.zip.ZipUnArchiverTest) Time elapsed: 0.021 s <<< FAILURE!
junit.framework.AssertionFailedError
at org.codehaus.plexus.archiver.zip.ZipUnArchiverTest.testUnarchiveUtf8(ZipUnArchiverTest.java:86)
This is a locale error. It assumes you use xx_YY.UTF-8. What is your locale?
sh-4.2# locale LANG= LC_CTYPE="POSIX" LC_NUMERIC="POSIX" LC_TIME="POSIX" LC_COLLATE="POSIX" LC_MONETARY="POSIX" LC_MESSAGES="POSIX" LC_PAPER="POSIX" LC_NAME="POSIX" LC_ADDRESS="POSIX" LC_TELEPHONE="POSIX" LC_MEASUREMENT="POSIX" LC_IDENTIFICATION="POSIX" LC_ALL=
This will not work. We need to test for file.encoding/Charset#default() and skip such tests.
Can you disable this test and see whether the rest works?
I can reproduce with LC_ALL=C mvn verify:
[ERROR] Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.104 s <<< FAILURE! - in org.codehaus.plexus.archiver.zip.ZipUnArchiverTest
[ERROR] testUnarchiveUtf8(org.codehaus.plexus.archiver.zip.ZipUnArchiverTest) Time elapsed: 0.021 s <<< FAILURE!
junit.framework.AssertionFailedError
at org.codehaus.plexus.archiver.zip.ZipUnArchiverTest.testUnarchiveUtf8(ZipUnArchiverTest.java:86)
on
$ LC_ALL=C mvn -v
Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T20:33:14+02:00)
Maven home: /usr/local/apache-maven-3.5.4
Java version: 1.8.0_242, vendor: Oracle Corporation, runtime: /usr/local/openjdk8/jre
Default locale: en_US, platform encoding: US-ASCII
OS name: "freebsd", version: "12.1-stable", arch: "amd64", family: "unix"
These tests need to be viewed whether they work as intended:
$ grep -ri -E -e utf-?8 src/test/java/
src/test/java/org/codehaus/plexus/archiver/jar/DirectoryArchiverUnpackJarTest.java: archiver.addArchivedFileSet( afs, Charset.forName( "UTF-8" ) );
src/test/java/org/codehaus/plexus/archiver/tar/TarArchiverTest.java: File tmpDir = getTestFile( "src/test/resources/utf8" );
src/test/java/org/codehaus/plexus/archiver/zip/ConcurrentJarCreatorTest.java: zos.setEncoding( "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/PlexusIoZipFileResourceCollectionTest.java: final String manifest = IOUtils.toString( contents1, "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java: zipArchiver.setEncoding( "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java: zipArchiver2.setEncoding( "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java: zipArchive.setEncoding( "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java: public void testDefaultUTF8()
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java: final ZipArchiver zipArchiver = getZipArchiver( new File( "target/output/utf8-default.zip" ) );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java: zipArchiver.addDirectory( new File( "src/test/resources/miscUtf8" ) );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java: public void testDefaultUTF8withUTF8()
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java: final ZipArchiver zipArchiver = getZipArchiver( new File( "target/output/utf8-with_utf.zip" ) );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java: zipArchiver.setEncoding( "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java: zipArchiver.addDirectory( new File( "src/test/resources/miscUtf8" ) );
src/test/java/org/codehaus/plexus/archiver/zip/ZipUnArchiverTest.java: public void testUnarchiveUtf8()
src/test/java/org/codehaus/plexus/archiver/zip/ZipUnArchiverTest.java: File dest = new File( "target/output/unzip/utf8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipUnArchiverTest.java: final File zipFile = new File( "target/output/unzip/utf8-default.zip" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipUnArchiverTest.java: zipArchiver.addDirectory( new File( "src/test/resources/miscUtf8" ) );
I tried to set the locale using export LC_ALL=en_US.UTF-8, post this the build was successful.
[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ plexus-archiver ---
[INFO] Installing /root/plexus-archiver-master/target/plexus-archiver-4.2.3-SNAPSHOT.jar to /root/.m2/repository/org/codehaus/plexus/plexus-archiver/4.2.3-SNAPSHOT/plexus-archiver-4.2.3-SNAPSHOT.jar
[INFO] Installing /root/plexus-archiver-master/pom.xml to /root/.m2/repository/org/codehaus/plexus/plexus-archiver/4.2.3-SNAPSHOT/plexus-archiver-4.2.3-SNAPSHOT.pom
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 41.100 s
[INFO] Finished at: 2020-04-15T13:41:29Z
[INFO] ------------------------------------------------------------------------
Hi @sarveshtamba, this is very interesting issue. Thanks for reporting it. It seems that there is a bug in Plexus Archiver when working with files containing unicode characters when the system locale is not UTF-8.
I've set my locale to POSIX to reproduce your environment. Here is what is the behavior on my system, would you please confirm that you see the same on yours.
When I clone Plexus Archiver the files are cloned:
$ ls src/test/resources/miscUtf8/
aFileWithA#.html
'aPi'$'\303\261''ata.txt'
'an'$'\303\274''mlaut.txt'
''$'\342\202\254''uro.txt'
Although ls shows the files with escape characters they are there and their names are the same as in the repository. When I ran the maven build the files with special characters are missing from the output directory:
$ mvn clean verify
$ ls target/output/unzip/utf8
aFileWithA#.html
The generated zip file also includes a single file:
$ unzip -l target/output/unzip/utf8-default.zip
Archive: target/output/unzip/utf8-default.zip
Length Date Time Name
--------- ---------- ----- ----
20 2020-04-17 10:06 aFileWithA#.html
--------- -------
20 1 file
If I use zip and unzip everything works as expected - all files are compressed and with the correct names. So this is not some limitation of the system or ZIP itself. It is a defect in Plexus Archiver.
When I change the locale to en_US.UTF-8 Plexus Archiver also behaves as expected:
$ LC_ALL=en_US.UTF-8 mvn verify
$ ls target/output/unzip/utf8
aFileWithA#.html
'aPi'$'\303\261''ata.txt'
'an'$'\303\274''mlaut.txt'
''$'\342\202\254''uro.txt'
$ unzip -l target/output/unzip/utf8-default.zip
Archive: target/output/unzip/utf8-default.zip
Length Date Time Name
--------- ---------- ----- ----
31 2020-04-17 10:06 €uro.txt
20 2020-04-17 10:06 aFileWithA#.html
39 2020-04-17 10:06 anümlaut.txt
29 2020-04-17 10:06 aPiñata.txt
--------- -------
119 4 files
p.s. I'm testing on Ubuntu, ext4 file system with locale set to POSIX, but I would expect that the behavior is the same on other POSIX/Unix like systems.
The problem is Java cannot properly map bytes to characters when encoding is wrong. Unix filesystems are not charset aware. They simply store bytes, not codepoints.
Thanks for the inputs @michael-o @plamentotev do you still want me to verify this?
@michael-o thanks for the tip. It really looks like the character encoding is the problem. Still it looks like if Path and URI are used Java can work with such files as expected. The URI has the bytes properly escaped. As now Java 7 is required maybe we can look into those "new" APIs in order to better support use cases as the one reported here.
@sarveshtamba thanks. I think I understood where the issue is, so no need to verify it.
@plamentotev @michael-o thanks for the inputs.
This is not an issue for plexus-archiver, it's how Java works, Java uses the locale from the operating system, if the OS is configured with a non-utf8 locale, then Java will use that, and not even the new Java 7 APIs will help here:
Accented or extended UTF-8 characters cause "Malformed input or input contains unmappable characters" error.
java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters: target/piñata.txt
at java.base/sun.nio.fs.UnixPath.encode(UnixPath.java:145)
at java.base/sun.nio.fs.UnixPath.<init>(UnixPath.java:69)
at java.base/sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:280)
at java.base/java.io.File.toPath(File.java:2290)
Java 11 won't support setting sun.jnu.encoding to UTF-8 via the command line to use UTF-8 for encoding file paths. It will silently ignore it and will not have any effect.
So the only possible and real solution is to use the correct locale, at least C.UTF-8.
LANG=C.UTF-8 mvn verify
So, as this is a Java thing, even trying to clean the project with LANG=C mvn clean will fail if the target directory contains a UTF-8 encoded filename.
UnixPath operates on raw bytes, WindowsPath on wchar_t. I still think those tests need to be skipped with a warning.
UnixPathoperates on raw bytes,WindowsPathonwchar_t. I still think those tests need to be skipped with a warning.
Well, just to show a warning message that a utf-8 locale is required, it works, but just hides the real issue.
UnixPathoperates on raw bytes,WindowsPathonwchar_t. I still think those tests need to be skipped with a warning.Well, just to show a warning message that a utf-8 locale is required, it works, but just hides the real issue.
Correct, but unfortunately I don't see a better portable way on POSIX-like systems.