Using reverted changesets for model training
Per text with @batpad,
Changeset comment has revert
There are a total of 13,125 changesets on osmcha with revert in the changeset comment. Interestingly, 2,505 (20%) changesets are one feature modification changesets which is what we use in the latest version of Gabbar.
Assuming, mappers revert a problematic or wrong feature in these one feature modification changesets, this could be an additional dataset we could make use of for the current iteration of the feature level classifier of Gabbar. I manually :eyes: a couple of these changesets and they are definitely want we want to catch with Gabbar.
- https://osmcha.mapbox.com/49465923
- https://osmcha.mapbox.com/49442894
Changesets from revert user accounts
Mappers and DWG sometimes maintain a separate account for reverts. Changesets from these accounts will be interesting to look at as well. Ex:
- https://www.openstreetmap.org/user/SomeoneElse_Revert/history
cc: @anandthakker @geohacker
Found 2604 changesets that had the geojson version of it in real changesets. Assuming all features in changesets with revert in the changeset comment are correcting a harmful change, I get changeset IDs of the previous version of all features in these changesets. Ex:
- Changeset 49692400, has a changeset comment:
Reverting 49662828 by Demo15_15 - There is one feature deleted in this changeset, Node: 4924043021
- The changeset that previously touched this feature is 49662828
- This is the potentially problematic edit we are interested
Following this workflow, I find a list of 14,062 unique changesets. Ideally, this is a list of changesets that had a problematic feature which was later reverted. The next step was to see what percentage of this was recent, (say in 2017) and have real changesets version so that we can use it as part of the training/validation dataset in Gabbar.
- There are
10,691(76%) potentially problematic changesets that have real changeset.
@manoharuss @krishnanammala, need your help here. Can you randomly :eyes: about 100 changesets from this list to see what percentage of the 100 are problematic. This will help us understand what to expect and if this can be used as training dataset in Gabbar.
- https://gist.github.com/bkowshik/bddc87ca4dd74c37d9ae097985a28edc
Next actions
- [ ] Review 100 changesets randomly (some at the top, middle, end, etc)
- [ ] What percentage of the 100 changesets are actually problematic?
- [ ] Now that we know the percentage, can we use it in Gabbar - @bkowshik
cc: @planemad
@bkowshik any changeset reverted by an experienced editor (>100edits) we can safely say was definitely a bad one. Lets use our time time more wisely to review only those that were reverted by a inexperienced user (<20 edits), this is where we might find some false negatives.
Other highly valuable questions to answer here:
- What is the average response time for the community to fix a bad change?
- Is there any corelation between the response time and user attributes like experience, mapping activity.. ?
cc @maning @batpad
Thank you @planemad, that was super helpful!
Reverting changesets
- Out of the total of
12257changesets, only2604(21%) have real changesets. - There are a total of
468mappers. -
2493(95%) reverting changesets were by users with 100 or more changesets -
21(1%) reverting changesets were from users with less than 20 changesets
The CSV with 21 reverting changesets by new users is at the link below:
- https://gist.github.com/bkowshik/3bb6edd4bb12c5f4be712d64338d4614
Yes, there is a correlation between the experience of the user and number of reverting changesets. Reverting changesets are way more likely from experienced users than new users.

What is more interesting is that user_mapping_days has a stronger correlation at 0.6 to number of reverting changesets in comparison to user_changesets with a correlation of 0.3. So, the mapping days of the user is a stronger indicator.

Reverted changesets
I couldn't resist finding who's changeset were getting reverted - the other side of the story.
- Out of the total of
14062changesets, only10693(76%) have real changesets - There are a total of
1301mappers with one or more reverted changesets -
8679(61%) reverted changesets were of users with 100 or more changesets -
1161(8%) reverted changesets were of users with less than 20 changesets - User DACGroup had
60%of his/her changesets reverted,4106reverted changesets
The number of a users changesets getting reverted comes down as the user has more changesets, the user gains more mapping experience.

As expected, the user mapping days is negatively correlated, -0.3. Thus, higher a users mapping days, less likely of changeset being reverted.

Per https://github.com/mapbox/gabbar/issues/66#issuecomment-310029426
There are 21 reverting changesets by users less than 20 changesets. @manoharuss @krishnanammala can you please 👀 these and post notes about what percentage of this 21 are actually problematic?
- https://gist.github.com/bkowshik/3bb6edd4bb12c5f4be712d64338d4614
cc: @planemad