Weighted distribution across multiple variables
Could be a useful addition to your library. As an example, I'm interested in getting stats on race and gender in a group over time. Something like:
data_by_year = data.groupby(['year'])
race_gender_demographics = calc.distribution(data_by_year, ['race', 'gender']).round(3)
Hi @soooh! You should actually be able to do this with the current code. It'll depend on what, exactly, you're looking to calculate. But lets say you're looking for the weighted distribution of race, by gender and over time. In that case, this should work:
grouped = data.groupby([ "year", "gender" ])
dist = calc.distribution(grouped, "race").round(3)
Does that work? Are you aiming for something slightly different?
Ah, so what that does is give me the racial demographics of women and men separately. E.g., of all the women, 20% are white, 10% are black, and so on. What I want is something like, 10% of the group is white men, 8% white women, etc. Does that make sense?
Ah, so what that does is give me the racial demographics of women and men separately. E.g., of all the women, 20% are white, 10% are black, and so on.
Yep.
What I want is something like, 10% of the group is white men, 8% white women, etc. Does that make sense?
Ah, sounds like I misunderstood the goal. In that case, the easiest way might be like so:
data["race_x_gender"] = data[[ "race", "gender" ]].apply(" x ".join, axis=1)
dist = calc.distribution(data.groupby("year"), "race_x_gender").round(3)
Does that achieve your goal? (It assumes that race and gender are strings.)
I'll also think about ways I could incorporate a generic feature like this into the library itself. Thanks for the suggestion!
Ah yes, that is actually what I am doing! 😄
I thought it could be a useful feature, though, which is why I suggested it.