taxdata Fix PUF SOI estimates

This PR addresses issue #399.

The updatesoi.py file is used to automatically update our SOI estimates, but the range of indicies used to add up total wages for those with AGI greater than $1 million was off and excluded some of the data, as @donboyd5 figured out. This PR fixes that and adds an additional check to updatesoi.py to prevent an issue like this from going undetected in the future.

The bug affected our SOI estimates for 2015-2017. They have all been fixed in this PR.

Jan 10 '22 17:01 andersonfrailey

This PR is just about done, but with the changes there's a big increase in tax liability for 2030 and 2031 that I can't explain. I've attached a table comparing the tax liabilities that were found with taxcalc 3.2.1. Screen Shot 2022-01-10 at 2 58 31 PM

Jan 10 '22 20:01 andersonfrailey

Would it be hard to construct a table for two years, each with total wages by an income classifier (e.g., AGI range) pre-PR and with PR, one for (a) the year before the first surprising year (i.e., 2029), and one for the first surprising year (2030)?

For example, tables a and b would have stubs such as:

income range wages pre-PR wages PR change <= $0 ... $1m-10m $10m+ sum

We'd then see (1) which income ranges are causing the problem, and (2) how those income ranges changed between 2029 and 2030. Of course it could be something else, but this seems like the most likely suspect.

If we verify that the issue is caused by changes in wages, and we figure out which income ranges are at work (probably the top 2 or 3), it would then make sense to look at growfactors moving from 2029 and 2030, and at wage targets for 2029 and 2030. My guess (uninformed) is that there is something surprising about the targets in 2029 but it could be growfactors. Of course it could be something else entirely, such as # returns with wages (again the same sort of breakdowns with #s would be helpful), or even something out of left field, but I'd suggest looking at wages first.

Don

On Mon, Jan 10, 2022 at 3:58 PM andersonfrailey @.***> wrote:

This PR is just about done, but with the changes there's a big increase in tax liability for 2030 and 2031 that I can't explain. I've attached a table comparing the tax liabilities that were found with taxcalc 3.2.1. [image: Screen Shot 2022-01-10 at 2 58 31 PM] https://user-images.githubusercontent.com/20684675/148838484-adc94792-cae2-45ff-b5f8-ca7580441885.png

— Reply to this email directly, view it on GitHub https://github.com/PSLmodels/taxdata/pull/411#issuecomment-1009336259, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABR4JGDDQFKS2B2US6BRGW3UVNCBFANCNFSM5LUHK6NQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

Jan 10 '22 21:01 donboyd5

Deleted my last comment, Realized there was an issue with how I was looking at the data when I was making those charts

Jan 12 '22 00:01 andersonfrailey

@andersonfrailey Is this PR ready for review? Last comment made that unclear. Also, could be helpful to produce tables Don suggested to aid in review.

Mar 01 '22 20:03 jdebacker

@jdebacker, I haven't been able to fix the issue with tax revenues jumping up in the last few years yet, but I'd definitely be open to others seeing if they get the same result when they run the changes in this PR. Spring break starts tomorrow so I should be able to work on those tables Don suggested this week!

Mar 04 '22 02:03 andersonfrailey

Great - thanks for the update!

Mar 04 '22 02:03 jdebacker

@andersonfrailey, I ran make puf-files with this PR and hit several ITERATION_LIMIT and INFEASIBLE terminations during stage 2. Is this to be expected? Terminal output in this gist.

I'm going to push forward to replicate your revenue table and then create Don's suggested tables in the next few days, but thought I should check in on this. Thanks!

Mar 30 '22 23:03 MattHJensen

A few thoughts:

2016:

it hit the iteration limit of 100
obj function is NaN -- not sure what to make of this, but note that the dual objective is huge; perhaps that is what it always reports when the limit is hit; worth knowing
the targets were satisfied (the percentile values for ratio of calculated to desired targets are all approximately 1)
the ratios of new weights to old weights ranged from 0.69 to 1.86
but the computer code says the tolerance around the weights was 0.30 in 2016, meaning we'd want the ratio to range from 0.70 to 1.30; thus, it was only able to hit the targets by making some weights larger than it was told to make them

I doubt that raising the iteration limit would solve this
you might find this solution acceptable and might consider increasing the tolerance around the weights (e.g., +/- 0.50 or something like that, or make it asymmetric (with a coding change))
or you might examine the targets - they may be very hard to hit (and perhaps unreasonable - or unreasonable in relation to the data, meaning the data might be unreasonable in some fashion)
it might be possible to investigate by looking at the target values you get using the 2016 growfactors and the starting weights (I think these would be the 2015 weights, but they might be 2011, I don't remember which the code uses - base year or prior year) and seeing if some are just way different from the 2016 targets - checking to see which variables are way off; the ones that are way off might have a bad 2016 growfactor or a bad 2016 target, or it may just be that the world changed in a way that is hard for a grown 2011 file to hit

2029

iteration limit of 100 was hit
objective is NaN -- not sure what to make of this, but note that the dual objective is huge
targets were hit
weight ratios were almost in range -- 0.54 to 1.47 -- didn't quite make +/- 0.45
but notice the strange distribution -- almost all of them were driven down by ~0.45 or up by 0.45, so it is clear that the constraints can only be hit by jerking the weights around
again I would investigate the constraints, looking for one (or more) of three possibilities: (1) bad (implausible) targets, (2) bad data resulting from bad growfactors, or (3) plausible targets and plausible growfactors, but the world (targets) changed in ways that our very old data has a hard time mimicking

2030 (shown) and 2031

Things are really out of hand at this point and the constraints are not feasible.

I haven't looked at the targets but my intuition would be to look at the way the correction to the wage targets was implemented to see if the full set of new targets is internally inconsistent. This might be the underlying issue, and perhaps also causing undesirable results in other years, too, even though no flags were raised.

Mar 31 '22 23:03 donboyd5

@MattHJensen Inconsistent targets could arise if, for example:

You had a wage target for each agi range, PLUS a total wage target (which would not be a great idea, because it would be redundant), AND the sum of the wage targets by range was not consistent with the total wage target, or
You had a wage target for each agi range, plus total targets for other income variables, plus a target for total income (e.g., agi), AND the sum of all the income targets was not consistent with the total income target

There are ways to be inconsistent, too, but these are obvious ones.

Mar 31 '22 23:03 donboyd5

@MattHJensen Can you test this branch again with the latest changes and see if you still get that error with iteration limits?

Apr 24 '23 17:04 jdebacker

An update on this branch. I've tested this with the latest versions of Julia and Tulip and still get an iteration limit hit in 2029 and "infeasible" after that.

Will look into targets in more detail next, but was hoping a new solver would do the trick...

cc @andersonfrailey @donboyd5 @MattHJensen

May 30 '23 13:05 jdebacker