Danica J. Sutherland
Danica J. Sutherland
This maybe leaks a little bit of memory, and definitely uses the GIL for no real reason.
I think in my pre-pummeler attempt at this I did `sign(x) * log(x + 1*sign(x))` or something. `log(x - min(x))` isn't shaped very nicely if `min(x)` is, say, -915,729,293.
I was a little off before: what I want is `sign(x) * log( |x| + 1 )`, which maintains both sign information and magnitude information. Doing `log(x - min(x) +...
IIRC `RACNUM` is the flag for how many racial groups the person has indicated, with `RAC1P` the first race, `RAC2P` the second, etc.
Seems like in this case: - feather is *way* faster to load but also bigger on disk - parquet is slightly smaller than hdf5 and way faster to load So...
With the new two-pass scheme with the merge at the end, the state merger is fast, but puma merger is quite slow. Not sure whether this is due to casting...
Could also multiprocess the merging.
One thing that might help is to call MKL straight instead of through numpy: gemm would save a dot-then-scale, sincos is much faster than np.sin plus np.cos (https://github.com/numpy/numpy/issues/2626#issuecomment-235785553), probably some...
- CITWP, YOEP, JWMNP: mean-coding blanks might not be the right thing, since blank means the person was born in the US / doesn't work - MLP\* (when served in...
Hmm, you're right. It looks like it's just in the notebook, the code itself doesn't have any rescaling options. I'll fix and rerun the notebook tonight.