tibiadata-api-go icon indicating copy to clipboard operation
tibiadata-api-go copied to clipboard

fix: character issues with umlauts

Open Skyliife opened this issue 9 months ago • 2 comments

link to issue: https://github.com/TibiaData/tibiadata-api-go/issues/470

  • Wrap incoming HTML in charset.NewReader before goquery parsing
  • Ensures ISO‑8859‑1 (and other legacy) input is normalized to UTF‑8
  • Prevents “mojibake” (e.g. “ä” instead of “ä”)
  • Updated TestWorldAntica to simulate Latin‑1 input and verify correct Umlaut decoding
  • Added Antica.html for parsing character Näurin

Closes #470

Skyliife avatar Apr 18 '25 12:04 Skyliife

@tobiasehlert I’ve updated the HTML collector to use charset.NewReader with the real Content-Type header instead of our custom converter, so incoming pages should now be normalized to proper UTF‑8 and preserve Umlauts (e.g. “Näurin”). I’m not super familiar with all the Go idioms here, so I’d really appreciate if someone could double check my changes.

Skyliife avatar Apr 18 '25 14:04 Skyliife

List of some umlaut-characters:

  • Näurin
  • Hidofäs
  • König der Toten
  • Torbjörn
  • Sir Pösi
  • Wiliam Lundström
  • Der Nachtjäger
  • Stählerner Krieger
  • Nöber of Guards
  • Skalle pär
  • Höfix
  • Bürgy
  • Wächter der Hölle
  • Gordon Dödsmetal
  • Nöber

tobiasehlert avatar Sep 02 '25 12:09 tobiasehlert

Thanks for your PR @Skyliife, but I've created #506 to only adress the umlaut issue itself.

Any particular reason why we should switch to charset.NewReader? I see maybe the benefit in using the Content-Type header, but maybe I miss something else.

tobiasehlert avatar Sep 17 '25 12:09 tobiasehlert

@Skyliife, didn't notice that the encoding from tibia.com is utf-8 now.. so should have given you credits in #511.

tobiasehlert avatar Sep 23 '25 09:09 tobiasehlert