weightedcalcs Weighted distribution across multiple variables

Could be a useful addition to your library. As an example, I'm interested in getting stats on race and gender in a group over time. Something like:

data_by_year = data.groupby(['year'])
race_gender_demographics = calc.distribution(data_by_year, ['race', 'gender']).round(3)

Apr 13 '17 15:04 soooh

Hi @soooh! You should actually be able to do this with the current code. It'll depend on what, exactly, you're looking to calculate. But lets say you're looking for the weighted distribution of race, by gender and over time. In that case, this should work:

grouped = data.groupby([ "year", "gender" ])
dist = calc.distribution(grouped, "race").round(3)

Does that work? Are you aiming for something slightly different?

Apr 13 '17 16:04 jsvine

Ah, so what that does is give me the racial demographics of women and men separately. E.g., of all the women, 20% are white, 10% are black, and so on. What I want is something like, 10% of the group is white men, 8% white women, etc. Does that make sense?

Apr 13 '17 16:04 soooh

Ah, so what that does is give me the racial demographics of women and men separately. E.g., of all the women, 20% are white, 10% are black, and so on.

Yep.

What I want is something like, 10% of the group is white men, 8% white women, etc. Does that make sense?

Ah, sounds like I misunderstood the goal. In that case, the easiest way might be like so:

data["race_x_gender"] = data[[ "race", "gender" ]].apply(" x ".join, axis=1)
dist = calc.distribution(data.groupby("year"), "race_x_gender").round(3)

Does that achieve your goal? (It assumes that race and gender are strings.)

I'll also think about ways I could incorporate a generic feature like this into the library itself. Thanks for the suggestion!

Apr 13 '17 16:04 jsvine

Ah yes, that is actually what I am doing! 😄
I thought it could be a useful feature, though, which is why I suggested it.

Apr 13 '17 16:04 soooh