python-goose
python-goose copied to clipboard
DocumentCleaner remove_nodes_re
I keep finding sites that fail to parse due to the remove_nodes_re. Do you think there's a better way this can be handled?
e.g. the following article fails due to the class of 'scrolling-wrapper' on the parent div of the article:
http://www.shutterstock.com/blog/ceo-jon-oringers-message-to-the-next-generation-embrace-failure
I agree. The document cleaner is too strict. this commit should help a36b5a8ae1291fdf6e7e7e3e469ec3768faa7cfa but it doesn't seems to be enough for this article.
It doesn't get the beginning of the article