DOC ACSIncome vs. UCI Adult
Describe the issue linked to the documentation
This issue is intended to start a discussion about the ACSIncome dataset, which is 1 of 5 datasets created and published by Ding et al. (2021). The scope of PR #1005 is just to introduce the facts about the dataset (size, features, descriptions). However, some important questions remain, and the topics discussed here will likely become part of a future PR that augments the current documentation. Thank you to all reviewers on PR #1005 for contributing ideas that are summarized here.
ACSIncome is proposed as a replacement for UCI Adult. However, the paper focuses on analyzing various models trained on ACSIncome, but doesn't really analyze differences between ACSIncome vs. UCI Adult. Although ACSIncome certainly adds more data (1,664,500 vs. 48,842) and more recent data (2018 vs. 1994) and allows users to define any income threshold rather than fixing it at $50k, some questions remain unanswered:
- Are certain groups better represented in the updated dataset?
- Are any fairness disparities larger or smaller?
- What metrics should we use to evaluate if one dataset is a better fairness benchmark?
- Insert your question here
Another important question that was mentioned in PR #1005: Under what circumstances would Fairlearn want to recommend this dataset as a go-to benchmark for evaluating unfairness mitigation techniques? Some points that were discussed include
Pros: The 2018 US Census dataset captures examples of unfairness that provide opportunities to test unfairness migration techniques. For instance, across all occupations, on average, men worked 14% more hours than women (40.79 hours for men vs 35.67 hours for women). ACSIncome_hours_worked_distribution_by_sex.pdf However, men were paid 52% more on average. Similar results come from comparing the median.
To hold occupation constant, we can examine the most popular occupation code 2310 which corresponds to EDU-Elementary And Middle School Teachers from the data dictionary. For this subset, the distribution of age and hours worked are very similar between males and females, yet males were paid 18% more on average.
Disparities between race could also be examined.
Cons: The task definition may not be that realistic so it may be difficult to draw meaningful conclusions. Say we train (and mitigate) a model that predicts a person's income based on features like age, hours worked per week, sex, education level, etc. When would this model be used? Certainly not in a hiring process. Many forms (SNAP, insurance, obtaining a loan) require you to state and verify your income; they're not looking to guess it with a model that may or may not propagate unfairness. Maybe this information is useful for advertising? Perhaps a company will advertise their high-end products if you have a high income?
Looking forward to discussions and thanks again to everyone who contributed ideas 🙂
For reference, this notebook started some of the data analysis for ACSIncome. I encourage others to analyze the dataset further and incorporate any findings into the documentation :)