python-goose icon indicating copy to clipboard operation
python-goose copied to clipboard

DocumentCleaner remove_nodes_re

Open jeffnappi opened this issue 11 years ago • 1 comments

I keep finding sites that fail to parse due to the remove_nodes_re. Do you think there's a better way this can be handled?

e.g. the following article fails due to the class of 'scrolling-wrapper' on the parent div of the article:

http://www.shutterstock.com/blog/ceo-jon-oringers-message-to-the-next-generation-embrace-failure

jeffnappi avatar Jul 23 '14 23:07 jeffnappi

I agree. The document cleaner is too strict. this commit should help a36b5a8ae1291fdf6e7e7e3e469ec3768faa7cfa but it doesn't seems to be enough for this article.

It doesn't get the beginning of the article

grangier avatar Dec 30 '14 01:12 grangier