html-metadata
html-metadata copied to clipboard
⚠️ Self-closing tags get corrupted 🚨
The library doesn't support html5 tags (e.g. self-closing span).
When parsing the following:
<span itemprop="price" content="139.90" />
foo
bar
It adds "foo ... bar" to the price attribute until it won't find a closing </span> tag.
The issue is in chtml which replaces /> w/ >
Steps to reproduce:
var scrape = require('html-metadata');
scrape.loadFromString('<div itemscope><span itemprop="price" content="139.90" /> <span itemprop="priceCurrency" content="PLN" /></div>').then(e => console.log(JSON.stringify(e)));
// {"schemaOrg":{"items":[{"properties":{"priceCurrency":["PLN"],"price":[" "]}}]}}
Possible resolution:
- First of all,
htmlparser2should recognize self-closing:
var dom = microdataDom(htmlparser.parseDOM(html, {
decodeEntities: true,
+ recognizeSelfClosing: true
}), config);
- Secondly,
cheerio.load(html).html()should not replace/>w/>
var cheerio = require('cheerio');
cheerio.load('<div itemscope><span itemprop="price" content="139.90" /> <span itemprop="priceCurrency" content="PLN" /></div>').html()
// '<html><head></head><body><div itemscope><span itemprop="price" content="139.90"> <span itemprop="priceCurrency" content="PLN"></span></span></div></body></html>'
https://github.com/Janpot/microdata-node/issues/8
Looks like https://github.com/cheeriojs/cheerio/issues/598 might have a solution (setting {xmlMode: true} ? )
It's not enough (see # 1). And I'm not sure if "xml mode" supports html5.