George Richardson issues

Results 21 issues of


                                            George Richardson

Deal with scraping error

Trace ``` --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /Users/George/miniconda3/envs/d4d-internal-displacement/lib/python3.6/http/client.py in _get_chunk_left(self) 545 try: --> 546 chunk_left = self._read_next_chunk_size() 547 except ValueError: /Users/George/miniconda3/envs/d4d-internal-displacement/lib/python3.6/http/client.py in _read_next_chunk_size(self) 512 try: --> 513 return...

scraper

Implement online learning

As more articles are gathered, analysed and verified by a human, it would be nice for the ML models to self update. Open to discussion on tools and best practices...

discussion

modeling

Clean up repo and code

There's lots of unused imports and things like the notebooks could be better organised

beginner-friendly

Add status code for article retrieval

Currently we just return the article if it is scraped successfully, but only the message "retrieval failed" if not. Would be good to add the HTTP status code.

scraper

Deal with timeouts when scraping

Doesn't seem to happen very often, but have experienced a couple of timeouts while scraping (every few thousand articles). Will post the trace for the next one.

scraper

Deal with datetime issue

Sometimes no publication date is available and a blank string is returned. However the db model expects a date time. Possible fix in `scraper.Scraper.html_article`: ``` if not isinstance(a.publish_date, datetime.datetime): article_pub_date...

beginner-friendly

scraper

Rare case of site not returning true 404

Trace ``` --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /Users/George/miniconda3/envs/d4d-internal-displacement/lib/python3.6/http/client.py in _read_status(self) 282 try: --> 283 status = int(status) 284 if status < 100 or status > 999: ValueError: invalid...

scraper

Extract document details from PDF

Can we extract items such as the title and date published from a pdf?

enhancement

scraper

Integrate LSI classification approach into interpreter

Take approach from `classification` notebook and integrate into interpreter for classification and filtering articles.

interpreter

Scraper - Tag content type

During scraping, can we tag whether something is text/video/image/pdf. Extra dessert if you can discern between news/blog etc.

data-collection

scraper