Michael Heilman issues

Results 12 issues of


                                            Michael Heilman

tokenization issues for non-ascii texts

The NLTK tokenizer used in the code doesn't handle fancy quotation marks very well. They just end up attached to words rather than being separate tokens. We should probably either...

collapse_rst_labels.py could be cleaned up a bit

Some of the regular expressions are a bit unnecessarily complicated (e.g., including extraneous instances of `.*`), and in some cases, perhaps `str.startswith` could be used instead of `re.search`.

minor

need more specific use of logging

Currently, the code just uses `logging.info`, `logging.warning`, etc. to record log messages. It would be better to instantiate one logger for the module, or logging modules for each class, etc....

enhancement

remove the nltk POS tagger from convert_rst_discourse_tb.py

Currently, `convert_rst_discourse_tb.py` uses NLTK's POS tagger to create flat trees for sentences that are in the RST treebank but not the Penn Treebank. This dependency should eventually be removed and...

enhancement

parsing evaluation metrics

We need some methods/scripts to evaluate parsing performance. We probably want to do two things: a) replicate previous work that uses parseval so that we can easily report previous results...

LSTM model equations

The code says it implements the version of the LSTM from Graves et al. (2013), which I assume is this http://www.cs.toronto.edu/~graves/icassp_2013.pdf or http://www.cs.toronto.edu/~graves/asru_2013.pdf. However, it looks like the LSTM equations...

0 weight for unassociated words

In class_lm_cluster.compute_weight(), if two words don't occur by each other (i.e., paircount == 0), then the function returns 0.0 for the weight. Is this the appropriate behavior, given that it...

Use range headers when retrying S3 transfers

It'd be nice to use range headers to avoid re-downloading already-downloaded bytes when an S3 connection error happens. See #273.

I/O

Performance

progress bar for the file upload CLI

It'd be nice to have a progress bar show up for `civis files upload` and `civis files download`. possibilities: * https://pypi.python.org/pypi/tqdm * https://pypi.python.org/pypi/progress * https://pypi.python.org/pypi/progressbar2 * https://stackoverflow.com/questions/3173320/text-progress-bar-in-the-console

enhancement

add multioutput support to MLPRegressor

It'd be nice to have the ability to do multioutput regression.