pummeler
pummeler copied to clipboard
Utilities to analyze ACS PUMS files, especially for distribution regression / ecological inference
Seems like maybe pandas/pytables append is a lot slower than writing into a new file. (Or else the rewriting-when-strings-are-longer code is hitting a lot.) The sort step should probably pre-count...
This has the strange effect that eg the mean standardized `PINCP` across the US is `-0.16`. Probably not a huge deal, but still.
Seems like this might be a decent use-case for dask.
- `MIGPUMA` has joint meaning with `MIGSP`; same for `POWPUMA`/`POWSP`. - Why does `RELP` come up so much in the ridge models? What does it mean in practice?
Using 100 KDE features and all the categorical variables, I end up with a dataset that's `840x6578` so I'm inclined to do ridge regression. I tried to implement it in...
Here's the variables I think we should log transform, all representing income/wages/etc. VERSIONS = { ... 'log_transform_feats': '''INTP OIP PAP RETP SEMP SSIP SSP WAGP PERNP PINCP'''.split(), Only issue is...
The old Cython featurizer only took two minutes on low1 once dummies had been created; this new one takes two hours. Dunno how long dummies took, but not two hours....
- [ ] Put instructions in about getting the CQ Press data file - [x] Make it work with HuffPo results
I've been able to get the `sort` program running, returning the feature values for each region in parquet files (the package doesn't work when selecting h5 as the output file...