tikxml icon indicating copy to clipboard operation
tikxml copied to clipboard

Support for iso-8859-1

Open qLag opened this issue 5 years ago • 7 comments

I try to parse an XML that comes from an iso-8859-1 API (somes strings have french accents).

Unfortunately, tikXml seems only to work with UTF-8.

I tried to use a TypeConverter :

`class StringUT8Converter : TypeConverter<String> {

override fun read(value: String): String {
    return String(value.toByteArray(Charsets.ISO_8859_1), Charsets.UTF_8)
}

override fun write(value: String): String {
    return String(value.toByteArray(Charsets.UTF_8), Charsets.ISO_8859_1)

}

} ` but it doesn't work.

Do you think you can include other encodings than UTF-8 (for poor old webservices 😝 )?

Thx

qLag avatar Nov 25 '20 17:11 qLag

I didn't try but is it not possible to use buffer.read(_index_, this.charset) instead of buffer.readUtf!(_index_) in all the file XmlReader (with a default charset = Charsets.UTF_8)

And give the the possibility to define a custom Charset with TikXmlConfig that will be used in TikXml.java : XmlReader reader = XmlReader.of(source, config.charset);

qLag avatar Nov 25 '20 18:11 qLag

@qLag It is possible, Okio's API allows you to provide a charset with Buffer#readString(), Buffer#writeString(), and ByteString#encodeString()

I think the only issue is skipping the leading BOM for each charset. This is the current implementation.

private int nextNonWhitespace(boolean throwOnEof, boolean isDocumentBeginning) throws IOException {
  // Look for UTF-8 BOM sequence 0xEFBBBF and skip it
  if (isDocumentBeginning && source.rangeEquals(0, UTF8_BOM)) {
    source.skip(3);
  }
  ...
}

Not sure if this is the most optimal way to support skipping the BOM for each charset, but here's how OkHttp does it for several UTF charsets. https://github.com/square/okhttp/blob/3f946d0b13534bcd1662e58624b0fc5816d1cc14/okhttp/src/main/java/okhttp3/internal/Util.kt#L255-L265

Edit: FWIW, Moshi doesn't skip the BOM, you have to detect it and skip it yourself before handing the stream to Moshi. Perhaps that is another avenue of approach.

reline avatar Nov 25 '20 22:11 reline

I made a draft here #150, needs unit tests but I went ahead and started the leg work.

reline avatar Nov 26 '20 22:11 reline

Hi reline, Thank for your support :) I will please to test your feature when it gets ready 👍

qLag avatar Nov 30 '20 07:11 qLag

@qLag In the meantime you can always build a snapshot off of that branch if it's urgent and meets your needs. I'd like to get more feedback from the maintainers now.

reline avatar Dec 03 '20 00:12 reline

Hi reline,

I tried your draft using this line in Gradle : implementation 'com.github.reline:tikxml:iso-8859-1-SNAPSHOT'

And this in my code : val tikXml = TikXml.Builder() .charset(Charsets.ISO_8859_1) .exceptionOnUnreadXml(false) .build()

And... it works great ! 👍 😊 🎉 I needed to add these lines too in my build.gralde to make it work : packagingOptions { exclude 'META-INF/gradle/incremental.annotation.processors' }

Its a really good new. How can we proceed now to be included in Tickaroo/tikXML ? Thanks again :)

Qlag

qLag avatar Dec 16 '20 20:12 qLag

@qLag Glad that worked for you!

I updated the PR with some unit tests, only significant difference I made was fixing the XML declaration when writing in charsets other than UTF-8.

- XML_DECLARATION = ByteString.encodeUtf8("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
+ XML_DECLARATION = ByteString.encodeString("<?xml version=\"1.0\" encoding=\"" + charset.name() + "\"?>", charset);

Is anyone available to review it? @sockeqwe @Bodo1981

reline avatar Dec 22 '20 21:12 reline