DocumentCleaner remove_nodes_re

Open jeffnappi opened this issue 11 years ago • 1 comments

I keep finding sites that fail to parse due to the remove_nodes_re. Do you think there's a better way this can be handled?

e.g. the following article fails due to the class of 'scrolling-wrapper' on the parent div of the article:

http://www.shutterstock.com/blog/ceo-jon-oringers-message-to-the-next-generation-embrace-failure

Jul 23 '14 23:07 jeffnappi

I agree. The document cleaner is too strict. this commit should help a36b5a8ae1291fdf6e7e7e3e469ec3768faa7cfa but it doesn't seems to be enough for this article.

It doesn't get the beginning of the article

Dec 30 '14 01:12 grangier