Cannot parse UTF-16 xml
UTF-16 xml parsing does not seem to work at all.
Current Behaviour
To test this, try changing this function to print tag names. https://github.com/tafia/quick-xml/blob/118c07b456e5a4bb730ca6d762a261ab525784da/tests/unit_tests.rs#L904-L921
#[test]
#[cfg(feature = "encoding")]
fn test_unescape_and_decode_without_bom_removes_utf16le_bom() {
let mut reader = Reader::from_file("./tests/documents/utf16le.xml").unwrap();
reader.trim_text(true);
let mut txt = Vec::new();
let mut buf = Vec::new();
loop {
match reader.read_event(&mut buf) {
Ok(Event::Text(e)) => txt.push(e.unescape_and_decode_without_bom(&mut reader).unwrap()),
Ok(Event::Eof) => break,
Ok(Event::Start(e)) => txt.push(reader.decode(&e.local_name()).to_string()), // add this line
_ => (),
}
}
// print the content of txt
println!("{:?}", txt);
panic!();
assert_eq!(txt[0], "");
}
You would expect to see something like ["", "project"].
But instead, this comes out: ["", "㼀砀洀氀�", "\u{a00}�", "瀀爀漀樀攀挀琀�", "\u{a00}�", "⼀瀀爀漀樀攀挀琀�", "\u{a00}�"]
Likely Reason
It looks like the parser compares a single byte to check if that character is b'<'(0x3C) or other special characters, then increments the position by 1 byte. But in UTF-16, the < character is actually two bytes 0x3C 0x00.
So when the parser tries to parse <?..., which in UTF-8 is 0xFF 0xFE 0x3C 0x00 0x3F 0x00 ..., It consumes 0x3C(<) and thinks the element's raw name is 0x00 0x3F 0x00 .... (㼀 is 0x00 0x3F)
Changing the above code's line to txt.push(reader.decode(&e.local_name()[1..]).to_string()) makes the element names come out correct. (Even though Event::End and Event::Decl is lumped with Event::Start.): ["", "?xml", "\u{a00}�", "project", "\u{a00}�", "/project", "\u{a00}�"].