Danica J. Sutherland comments

Results 33 comments of


                                            Danica J. Sutherland

FLANNIndex handling of _data pointer

This maybe leaks a little bit of memory, and definitely uses the GIL for no real reason.

Log transform for US$ variables

I think in my pre-pummeler attempt at this I did `sign(x) * log(x + 1*sign(x))` or something. `log(x - min(x))` isn't shaped very nicely if `min(x)` is, say, -915,729,293.

Log transform for US$ variables

I was a little off before: what I want is `sign(x) * log( |x| + 1 )`, which maintains both sign information and magnitude information. Doing `log(x - min(x) +...

Log transform for US$ variables

IIRC `RACNUM` is the flag for how many racial groups the person has indicated, with `RAC1P` the first race, `RAC2P` the second, etc.

sorting into states directly is slow

Seems like in this case: - feather is *way* faster to load but also bigger on disk - parquet is slightly smaller than hdf5 and way faster to load So...

sorting into states directly is slow

With the new two-pass scheme with the merge at the end, the state merger is fast, but puma merger is quite slow. Not sure whether this is due to casting...

sorting into states directly is slow

Could also multiprocess the merging.

faster featurizer

One thing that might help is to call MKL straight instead of through numpy: gemm would save a dot-then-scale, sincos is much faster than np.sin plus np.cos (https://github.com/numpy/numpy/issues/2626#issuecomment-235785553), probably some...

Featurization issues

- CITWP, YOEP, JWMNP: mean-coding blanks might not be the right thing, since blank means the person was born in the US / doesn't work - MLP\* (when served in...

perhaps bug in the median trick

Hmm, you're right. It looks like it's just in the notebook, the code itself doesn't have any rescaling options. I'll fix and rerun the notebook tonight.