python-goose Add support for semantic markup

It would be fantastic to have the option to extract article data using Schema.org with a fallback to OpenGraph.

Example - http://www.wired.com/2014/05/star-wars-storyboards-video/

Wired makes effective use of schema.org as seen below:

<a itemprop="url headline name" href="http://www.wired.com/2014/05/star-wars-storyboards-video/" rel="bookmark" title="Permanent Link to Check Out Early Storyboards From the Original Star Wars Trilogy">Check Out Early Storyboards From the Original <em>Star Wars</em> Trilogy</a>
</h1>
<link itemprop="image" href="http://www.wired.com/wp-content/uploads/2014/05/star-wars-storyboards-feat.jpg" />
...
    <li class="entryDate"><time itemprop="datePublished" datetime="2014-05-06T06:30:56+00:00">05.06.14</time>&nbsp;&nbsp;&#124;&nbsp;&nbsp;</li>
...
<span itemprop="articleBody"><p><iframe width="660" height="371" src="//www.youtube.com/embed/8RlpNvUumy0" frameborder="0" allowfullscreen></iframe></p>
<p>Sure, everyone gets excited about <a href="http://www.wired.com/2014/05/jj-abrams-star-wars-video/" target="_blank">May the Fourth</a>
...
</p></span>

A minimal implementation could include:

Schema.org (http://schema.org/Article)
- headline
- author
- image
- datePublished
- articleBody
OpenGraph (http://ogp.me/)
- og:title
- og:image
- og:description

May 06 '14 17:05 jeffnappi

This was just a thought. I intend to implement this whether it becomes part of python-goose or not, but thought it would be good to open up a conversation about it.

Is this something that you would like to see added to python-goose?

May 06 '14 17:05 jeffnappi

Hello Jeff,

This had been a long time I was thinking the future of goose (version 2) would be based on html5 sementic tags extraction.

It seems to me obvious that most newsite are now using <article> and other specific tags in their markup, and this could speedup and make more realable text etraction as is more and more used for SEO optimization.

xav

May 07 '14 06:05 grangier

What is the roadmap for v2? Is this something you'll realistically have time for?

Jul 24 '14 10:07 mwjackson