Add support for semantic markup
It would be fantastic to have the option to extract article data using Schema.org with a fallback to OpenGraph.
Example - http://www.wired.com/2014/05/star-wars-storyboards-video/
Wired makes effective use of schema.org as seen below:
<a itemprop="url headline name" href="http://www.wired.com/2014/05/star-wars-storyboards-video/" rel="bookmark" title="Permanent Link to Check Out Early Storyboards From the Original Star Wars Trilogy">Check Out Early Storyboards From the Original <em>Star Wars</em> Trilogy</a>
</h1>
<link itemprop="image" href="http://www.wired.com/wp-content/uploads/2014/05/star-wars-storyboards-feat.jpg" />
...
<li class="entryDate"><time itemprop="datePublished" datetime="2014-05-06T06:30:56+00:00">05.06.14</time> | </li>
...
<span itemprop="articleBody"><p><iframe width="660" height="371" src="//www.youtube.com/embed/8RlpNvUumy0" frameborder="0" allowfullscreen></iframe></p>
<p>Sure, everyone gets excited about <a href="http://www.wired.com/2014/05/jj-abrams-star-wars-video/" target="_blank">May the Fourth</a>
...
</p></span>
A minimal implementation could include:
- Schema.org (http://schema.org/Article)
- headline
- author
- image
- datePublished
- articleBody
- OpenGraph (http://ogp.me/)
- og:title
- og:image
- og:description
This was just a thought. I intend to implement this whether it becomes part of python-goose or not, but thought it would be good to open up a conversation about it.
Is this something that you would like to see added to python-goose?
Hello Jeff,
This had been a long time I was thinking the future of goose (version 2) would be based on html5 sementic tags extraction.
It seems to me obvious that most newsite are now using <article> and other specific tags in their markup, and this could speedup and make more realable text etraction as is more and more used for SEO optimization.
xav
What is the roadmap for v2? Is this something you'll realistically have time for?