Splitting on tags more than 1 level deep confuses xmlsplit
XML File
I have the following small XML file
<?xml version="1.0" encoding="utf-8"?>
<Outer>
<Inner attr="xxx">
<A>1</A>
</Inner>
<Inner otherattr="yyy">
<A>2-0</A>
<A>2-1</A>
<A>2-2</A>
<A>2-3</A>
</Inner>
<Inner>
<A>
<B attr="AA"/>
<C>
<D Dattr="Value"/>
</C>
</A>
</Inner>
</Outer>
Program
And the following file
import fs from 'fs';
const XmlSplit = require('xmlsplit');
const xmlsplit = new XmlSplit(1, 'A'); // Splitting on Tag <A>
const CHUNK_SIZE = 200; // bytes
const xmlfile = 'Test.xml';
async function start() {
const stream = fs.createReadStream(xmlfile, { highWaterMark: CHUNK_SIZE});
stream.pipe(xmlsplit).on('data', function(data: any) {
const xmlDocument = data.toString();
console.log(xmlDocument);
console.log('--------------------------------------')
});
}
start();
Expected output
You would expect different XML documents with A-tags, either
<Outer>
<Inner>
<A>
...
</A>
<Inner>
</Outer
or an XML without the Inner tag.
Realized output
But XmlSplit return the following:
<?xml version="1.0" encoding="utf-8"?>
<Outer>
<Inner attr="xxx">
<A>1</A></Outer>
--------------------------------------
<?xml version="1.0" encoding="utf-8"?>
<Outer>
<Inner attr="xxx">
</Inner>
<Inner otherattr="yyy">
<A>2-0</A></Outer>
--------------------------------------
<?xml version="1.0" encoding="utf-8"?>
<Outer>
<Inner attr="xxx">
<A>2-1</A></Outer>
--------------------------------------
<?xml version="1.0" encoding="utf-8"?>
<Outer>
<Inner attr="xxx">
<A>2-2</A></Outer>
--------------------------------------
<?xml version="1.0" encoding="utf-8"?>
<Outer>
<Inner attr="xxx">
<A>2-3</A></Outer>
--------------------------------------
<?xml version="1.0" encoding="utf-8"?>
<Outer>
<Inner attr="xxx">
</Inner>
<Inner>
<A>
<B attr="AA"/>
<C>
<D Dattr="Value"/>
</C>
</A></Outer>
--------------------------------------
If you look at the output returned you can see that in several instances the process gets confused.
Old question, and this is most likely this is not maintained but we actually ran into this error the other day and I set to Google'ing and found this...
I did a "dirty" fix as we don't really care about the nested elements, just to get it split up.
The problem is that the first dataChunk (index = 0) will retain the "parent", e.g.:
<Inner attr="xxx">
<A>1</A>
While the following dataChunk parts will be "clean":
<A>1</A>
When this is pieced together again it'll include <Inner attr="xxx"> from the first dataChunk but never close it.
As we don't care about the <Inner attr="xxx"> element I just added a fix on the next line of this:
https://github.com/remuslazar/node-xmlsplit/blob/7a7e081c226ebe0577b35743fd40b22064f621ee/lib/xmlsplit.js#L83
By:
dataChunks.forEach(function (data, index) {
const tagChk = new RegExp(`^<${this._tagName}[\\S|>]{1}`);
if (tagChk.test(data)) {
// eslint-disable-next-line no-param-reassign
data = data.slice(data.match(tagChk).at(0)?.length);
}
dataChunk += data;
This will strip away the element Inner from the first dataChunk data making the resulting XML valid.
@QAnders looks good to me. Could you please open a pull request with the changes above? Thanks!
Sure, I can do that, @remuslazar , but I need write access then to the repo... :)
I've reworked it a bit so that it does include the element as well now...
@QAnders you can fork this repo and create the PR which I can then merge later on (having write access).
PR open @remuslazar https://github.com/remuslazar/node-xmlsplit/pull/10