python-goose icon indicating copy to clipboard operation
python-goose copied to clipboard

Not getting any extracted text

Open peterswang opened this issue 10 years ago • 1 comments

Tried the following, but only got the title, and no text:

from goose import Goose url = 'http://householdproducts.nlm.nih.gov/cgi-bin/household/list?tbl=TblBrands&alpha=0' g = Goose() article = g.extract(url=url) article.title u'Household Products Database - Health and Safety Information on Household Products' article.meta_description '' article.cleaned_text[:2000] u''

Downloaded this page and tried extracting raw HTML as follows, and got the same result:

raw_html = html_file.read() a = g.extract(raw_html=raw_html) a.title u'Household Products Database - Health and Safety Information on Household Products' a.meta_description '' a.cleaned_text u'' raw_html '\n

\nHousehold Products Database - Health and Safety Information on Household Products\n\n\n\n\n<SCRIPT LANGUAGE="Javascript">\n<!--\nvar Ver4=parseInt(navigator.appVersion) >=4\nvar Nav4=((navigator.appName=="Netscape") && Ver4)\nvar IE4=((navigator.userAgent.indexOf("MSIE")!=-1) && Ver4)\nfunction linkout(query) {\n var q;\n var w;\n var mW;\n re=/[ \r\n]+/g;\nif (query) q=query;\nelse {\n if (Nav4)\n q=document.getSelection();\n else\n q=document.selection.createRange().text;\n if(!q)void(q=prompt('Enter text to search TOXNET. You can also highlight a term on this web page before clicking on this button.',''));\n}\n if(q) {\n document.toxsearch.queryxxx.value=q;\n mW=window.open('', 'p....................<LI>3-IN-ONE Multi-Purpose Oil with Telescoping Marksman Spout</A>\n<LI>3-IN-ONE Multi-Purpose Oil-02/01/2012</A>\n<LI>3-IN-ONE Multipurpose Oil-01/01/2005-Old Product</A>\n<LI>3-IN-ONE Professional Cleaner Degreaser</A>\n<LI>3-IN-ONE Professional Dry Lube</A>\n<LI>3-IN-ONE Professional Engine Starter</A>\n<LI>3-IN-ONE Professional Garage Door Lubricant</A>\n<LI>3-IN-ONE Professional Grade Pneumatic Tool Oil</A>\n<LI>3-IN-ONE Professional Penetrant Spray</A>\n<LI>3-IN-ONE Professional Silicone Spray Lubricant</A>\n<LI>3-IN-ONE Professional White Lithium</A>\n<LI>303 Cleaner and Spot Remover</A>\n<LI>303 Convertible Top Cleaner</A>\n<LI>303 Instant Windshield Washer Tablets</A>\n<LI>303 Shower Shield</A>\n<LI>303 Wiper Treatment</A>\n<LI>3BT Humidifier Bacteriostatic Treatment</A>\n...................

What might I not be doing right?

Thanks.

peterswang avatar Feb 25 '15 23:02 peterswang

The page is a list of links, I don't think goose is the right tool for this...

yprez avatar Feb 26 '15 08:02 yprez