George Richardson

Results 21 issues of George Richardson

Trace ``` --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /Users/George/miniconda3/envs/d4d-internal-displacement/lib/python3.6/http/client.py in _get_chunk_left(self) 545 try: --> 546 chunk_left = self._read_next_chunk_size() 547 except ValueError: /Users/George/miniconda3/envs/d4d-internal-displacement/lib/python3.6/http/client.py in _read_next_chunk_size(self) 512 try: --> 513 return...

scraper

As more articles are gathered, analysed and verified by a human, it would be nice for the ML models to self update. Open to discussion on tools and best practices...

discussion
modeling

There's lots of unused imports and things like the notebooks could be better organised

beginner-friendly

Currently we just return the article if it is scraped successfully, but only the message "retrieval failed" if not. Would be good to add the HTTP status code.

scraper

Doesn't seem to happen very often, but have experienced a couple of timeouts while scraping (every few thousand articles). Will post the trace for the next one.

scraper

Sometimes no publication date is available and a blank string is returned. However the db model expects a date time. Possible fix in `scraper.Scraper.html_article`: ``` if not isinstance(a.publish_date, datetime.datetime): article_pub_date...

beginner-friendly
scraper

Trace ``` --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /Users/George/miniconda3/envs/d4d-internal-displacement/lib/python3.6/http/client.py in _read_status(self) 282 try: --> 283 status = int(status) 284 if status < 100 or status > 999: ValueError: invalid...

scraper

Can we extract items such as the title and date published from a pdf?

enhancement
scraper

Take approach from `classification` notebook and integrate into interpreter for classification and filtering articles.

interpreter

During scraping, can we tag whether something is text/video/image/pdf. Extra dessert if you can discern between news/blog etc.

data-collection
scraper