George Richardson
George Richardson
Trace ``` --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /Users/George/miniconda3/envs/d4d-internal-displacement/lib/python3.6/http/client.py in _get_chunk_left(self) 545 try: --> 546 chunk_left = self._read_next_chunk_size() 547 except ValueError: /Users/George/miniconda3/envs/d4d-internal-displacement/lib/python3.6/http/client.py in _read_next_chunk_size(self) 512 try: --> 513 return...
As more articles are gathered, analysed and verified by a human, it would be nice for the ML models to self update. Open to discussion on tools and best practices...
There's lots of unused imports and things like the notebooks could be better organised
Currently we just return the article if it is scraped successfully, but only the message "retrieval failed" if not. Would be good to add the HTTP status code.
Doesn't seem to happen very often, but have experienced a couple of timeouts while scraping (every few thousand articles). Will post the trace for the next one.
Sometimes no publication date is available and a blank string is returned. However the db model expects a date time. Possible fix in `scraper.Scraper.html_article`: ``` if not isinstance(a.publish_date, datetime.datetime): article_pub_date...
Trace ``` --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /Users/George/miniconda3/envs/d4d-internal-displacement/lib/python3.6/http/client.py in _read_status(self) 282 try: --> 283 status = int(status) 284 if status < 100 or status > 999: ValueError: invalid...
Can we extract items such as the title and date published from a pdf?
Take approach from `classification` notebook and integrate into interpreter for classification and filtering articles.
During scraping, can we tag whether something is text/video/image/pdf. Extra dessert if you can discern between news/blog etc.