Add support for @itemprop="mainEntity" top-level items
An entity with @itemprop="mainEntity" is also a top-level entity (the primary entity described in the page), per: https://schema.org/mainEntity
If this is not something you want to support in the library, let me know and I'll add a (private) fork to our package repo instead.
Hi, thanks for your PR! I just tried with and without your modifications on the example HTML code located at https://schema.org/mainEntity (example 2).
Currently:
https://schema.org/WebPage
- https://schema.org/breadcrumb: Books > Literature & Fiction > Classics
- https://schema.org/mainEntity: (https://schema.org/Book)
With your PR:
https://schema.org/WebPage
- https://schema.org/breadcrumb: Books > Literature & Fiction > Classics
- https://schema.org/mainEntity: (https://schema.org/Book)
https://schema.org/Book
- https://schema.org/image: http://www.example.com/catcher-in-the-rye-book-cover.jpg
- ...
My remarks:
-
Bookis now on the same level asWebPage; is that what we want? -
Bookis now present both at root level, and underWebPageasmainEntity; should it be filtered out from there?
Thanks for your remarks! It's quite possible I've overlooked something or the page I'm testing this against isn't standards compliant.
Before, the mainEntity wasn't returned at all, since it wasn't the child of some other top-level element.
I will take some time tomorrow to test this against both my example and the example HTML code on schema.org and get back to you on your remarks.
Indeed; the difference is that the example I'm using, the mainEntity isn't a child of some other top-level item.
As far as I can see from the spec, that should be legal.
Here is the example from https://schema.org/mainEntity edited to reflect the situation that prompted this PR:
<body>
<div itemprop="mainEntity" itemscope itemtype="https://schema.org/Book">
<img itemprop="image" src="catcher-in-the-rye-book-cover.jpg"
alt="cover art: red horse, city in background"/>
<span itemprop="name">The Catcher in the Rye</span> -
<link itemprop="bookFormat" href="https://schema.org/Paperback">Mass Market Paperback
by <a itemprop="author" href="/author/jd_salinger.html">J.D. Salinger</a>
<div itemprop="aggregateRating" itemscope itemtype="https://schema.org/AggregateRating">
<span itemprop="ratingValue">4</span> stars -
<span itemprop="reviewCount">3077</span> reviews
</div>
<div itemprop="offers" itemscope itemtype="https://schema.org/Offer">
Price: $<span itemprop="price">6.99</span>
<meta itemprop="priceCurrency" content="USD" />
<link itemprop="availability" href="https://schema.org/InStock">In Stock
</div>
Product details
<span itemprop="numberOfPages">224</span> pages
Publisher: <span itemprop="publisher">Little, Brown, and Company</span> -
<meta itemprop="datePublished" content="1991-05-01">May 1, 1991
Language: <span itemprop="inLanguage">English</span>
ISBN-10: <span itemprop="isbn">0316769487</span>
Reviews:
<div itemprop="review" itemscope itemtype="https://schema.org/Review">
<span itemprop="reviewRating">5</span> stars -
<b>"<span itemprop="name">A masterpiece of literature</span>"</b>
by <span itemprop="author">John Doe</span>,
Written on <meta itemprop="datePublished" content="2006-05-04">May 4, 2006
<span itemprop="reviewBody">I really enjoyed this book. It captures the essential
challenge people face as they try make sense of their lives and grow to adulthood.</span>
</div>
<div itemprop="review" itemscope itemtype="https://schema.org/Review">
<span itemprop="reviewRating">4</span> stars -
<b>"<span itemprop="name">A good read.</span>" </b>
by <span itemprop="author">Bob Smith</span>,
Written on <meta itemprop="datePublished" content="2006-06-15">June 15, 2006
<span itemprop="reviewBody">Catcher in the Rye is a fun book. It's a good book to read.</span>
</div>
</div>
</body>
Currently, this library doesn't detect any Things in the given snippet, while I think it should.
With that out of the way, I agree that the output for the example you've posted is also not what we want. I can see two basic approaches:
- Only return the
mainEntityas a top-levelThingif it doesn't have a parent - Filter out
BookfromWebPageso we don't report it twice
I am happy to update this PR to do either - what do you think is best here?