Cannot parse UTF-16 xml

Open BlueGreenMagick opened this issue 4 years ago • 0 comments

UTF-16 xml parsing does not seem to work at all.

Current Behaviour

To test this, try changing this function to print tag names. https://github.com/tafia/quick-xml/blob/118c07b456e5a4bb730ca6d762a261ab525784da/tests/unit_tests.rs#L904-L921

#[test]
#[cfg(feature = "encoding")]
fn test_unescape_and_decode_without_bom_removes_utf16le_bom() {
    let mut reader = Reader::from_file("./tests/documents/utf16le.xml").unwrap();
    reader.trim_text(true);

    let mut txt = Vec::new();
    let mut buf = Vec::new();

    loop {
        match reader.read_event(&mut buf) {
            Ok(Event::Text(e)) => txt.push(e.unescape_and_decode_without_bom(&mut reader).unwrap()),
            Ok(Event::Eof) => break,
            Ok(Event::Start(e)) => txt.push(reader.decode(&e.local_name()).to_string()), // add this line
            _ => (),
        }
    }
   // print the content of txt
    println!("{:?}", txt); 
    panic!();
    assert_eq!(txt[0], "");
}

You would expect to see something like ["", "project"]. But instead, this comes out: ["", "㼀砀洀氀�", "\u{a00}�", "瀀爀漀樀攀挀琀�", "\u{a00}�", "⼀瀀爀漀樀攀挀琀�", "\u{a00}�"]

Likely Reason

It looks like the parser compares a single byte to check if that character is b'<'(0x3C) or other special characters, then increments the position by 1 byte. But in UTF-16, the < character is actually two bytes 0x3C 0x00.

So when the parser tries to parse <?..., which in UTF-8 is 0xFF 0xFE 0x3C 0x00 0x3F 0x00 ..., It consumes 0x3C(<) and thinks the element's raw name is 0x00 0x3F 0x00 .... (㼀 is 0x00 0x3F)

Changing the above code's line to txt.push(reader.decode(&e.local_name()[1..]).to_string()) makes the element names come out correct. (Even though Event::End and Event::Decl is lumped with Event::Start.): ["", "?xml", "\u{a00}�", "project", "\u{a00}�", "/project", "\u{a00}�"].

Sep 27 '21 10:09 BlueGreenMagick