Not getting any extracted text
Tried the following, but only got the title, and no text:
from goose import Goose url = 'http://householdproducts.nlm.nih.gov/cgi-bin/household/list?tbl=TblBrands&alpha=0' g = Goose() article = g.extract(url=url) article.title u'Household Products Database - Health and Safety Information on Household Products' article.meta_description '' article.cleaned_text[:2000] u''
Downloaded this page and tried extracting raw HTML as follows, and got the same result:
raw_html = html_file.read() a = g.extract(raw_html=raw_html) a.title u'Household Products Database - Health and Safety Information on Household Products' a.meta_description '' a.cleaned_text u'' raw_html '\n
\nHousehold Products Database - Health and Safety Information on Household Products \n\n\n\n\n<SCRIPT LANGUAGE="Javascript">\n<!--\nvar Ver4=parseInt(navigator.appVersion) >=4\nvar Nav4=((navigator.appName=="Netscape") && Ver4)\nvar IE4=((navigator.userAgent.indexOf("MSIE")!=-1) && Ver4)\nfunction linkout(query) {\n var q;\n var w;\n var mW;\n re=/[ \r\n]+/g;\nif (query) q=query;\nelse {\n if (Nav4)\n q=document.getSelection();\n else\n q=document.selection.createRange().text;\n if(!q)void(q=prompt('Enter text to search TOXNET. You can also highlight a term on this web page before clicking on this button.',''));\n}\n if(q) {\n document.toxsearch.queryxxx.value=q;\n mW=window.open('', 'p....................<LI>3-IN-ONE Multi-Purpose Oil with Telescoping Marksman Spout</A>\n<LI>3-IN-ONE Multi-Purpose Oil-02/01/2012</A>\n<LI>3-IN-ONE Multipurpose Oil-01/01/2005-Old Product</A>\n<LI>3-IN-ONE Professional Cleaner Degreaser</A>\n<LI>3-IN-ONE Professional Dry Lube</A>\n<LI>3-IN-ONE Professional Engine Starter</A>\n<LI>3-IN-ONE Professional Garage Door Lubricant</A>\n<LI>3-IN-ONE Professional Grade Pneumatic Tool Oil</A>\n<LI>3-IN-ONE Professional Penetrant Spray</A>\n<LI>3-IN-ONE Professional Silicone Spray Lubricant</A>\n<LI>3-IN-ONE Professional White Lithium</A>\n<LI>303 Cleaner and Spot Remover</A>\n<LI>303 Convertible Top Cleaner</A>\n<LI>303 Instant Windshield Washer Tablets</A>\n<LI>303 Shower Shield</A>\n<LI>303 Wiper Treatment</A>\n<LI>3BT Humidifier Bacteriostatic Treatment</A>\n...................
What might I not be doing right?
Thanks.
The page is a list of links, I don't think goose is the right tool for this...