html-metadata icon indicating copy to clipboard operation
html-metadata copied to clipboard

⚠️ Self-closing tags get corrupted 🚨

Open n-sviridenko opened this issue 6 years ago • 4 comments

The library doesn't support html5 tags (e.g. self-closing span).

When parsing the following:

<span itemprop="price" content="139.90" />

foo

bar

It adds "foo ... bar" to the price attribute until it won't find a closing </span> tag.

The issue is in chtml which replaces /> w/ >

n-sviridenko avatar Aug 09 '19 19:08 n-sviridenko

Steps to reproduce:

var scrape = require('html-metadata');
scrape.loadFromString('<div itemscope><span itemprop="price" content="139.90" /> <span itemprop="priceCurrency" content="PLN" /></div>').then(e => console.log(JSON.stringify(e)));

// {"schemaOrg":{"items":[{"properties":{"priceCurrency":["PLN"],"price":[" "]}}]}}

Possible resolution:

  1. First of all, htmlparser2 should recognize self-closing:
  var dom = microdataDom(htmlparser.parseDOM(html, {
    decodeEntities: true,
+   recognizeSelfClosing: true
  }), config);
  1. Secondly, cheerio.load(html).html() should not replace /> w/ >
var cheerio = require('cheerio');
cheerio.load('<div itemscope><span itemprop="price" content="139.90" /> <span itemprop="priceCurrency" content="PLN" /></div>').html()

// '<html><head></head><body><div itemscope><span itemprop="price" content="139.90"> <span itemprop="priceCurrency" content="PLN"></span></span></div></body></html>'

n-sviridenko avatar Aug 09 '19 19:08 n-sviridenko

https://github.com/Janpot/microdata-node/issues/8

n-sviridenko avatar Aug 09 '19 19:08 n-sviridenko

Looks like https://github.com/cheeriojs/cheerio/issues/598 might have a solution (setting {xmlMode: true} ? )

mvolz avatar Aug 10 '19 08:08 mvolz

It's not enough (see # 1). And I'm not sure if "xml mode" supports html5.

n-sviridenko avatar Aug 10 '19 11:08 n-sviridenko