python-goose
python-goose copied to clipboard
NY Times doesn't work
from goose import Goose extractor = Goose() article = extractor.extract(url='http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general') text = article.cleaned_text
NYT does a ton of redirecting, it's incredibly annoying. The strategy is to set the user agent to look like a browser and then continue from there (learned from a colleague at Factr). If it doesn't like the user agent, it will sometimes put you in an infinite redirect loop. It partially has to do with their paywall.