gabbar icon indicating copy to clipboard operation
gabbar copied to clipboard

Using reverted changesets for model training

Open bkowshik opened this issue 8 years ago • 5 comments

Per text with @batpad,

Changeset comment has revert

There are a total of 13,125 changesets on osmcha with revert in the changeset comment. Interestingly, 2,505 (20%) changesets are one feature modification changesets which is what we use in the latest version of Gabbar.

Assuming, mappers revert a problematic or wrong feature in these one feature modification changesets, this could be an additional dataset we could make use of for the current iteration of the feature level classifier of Gabbar. I manually :eyes: a couple of these changesets and they are definitely want we want to catch with Gabbar.

  • https://osmcha.mapbox.com/49465923
screen shot 2017-06-15 at 7 20 14 pm
  • https://osmcha.mapbox.com/49442894
screen shot 2017-06-15 at 7 23 52 pm

Changesets from revert user accounts

Mappers and DWG sometimes maintain a separate account for reverts. Changesets from these accounts will be interesting to look at as well. Ex:

  • https://www.openstreetmap.org/user/SomeoneElse_Revert/history
screen shot 2017-06-15 at 7 27 42 pm

cc: @anandthakker @geohacker

bkowshik avatar Jun 15 '17 14:06 bkowshik

Found 2604 changesets that had the geojson version of it in real changesets. Assuming all features in changesets with revert in the changeset comment are correcting a harmful change, I get changeset IDs of the previous version of all features in these changesets. Ex:

  • Changeset 49692400, has a changeset comment: Reverting 49662828 by Demo15_15
  • There is one feature deleted in this changeset, Node: 4924043021
  • The changeset that previously touched this feature is 49662828
  • This is the potentially problematic edit we are interested
screen shot 2017-06-21 at 11 37 40 am

Following this workflow, I find a list of 14,062 unique changesets. Ideally, this is a list of changesets that had a problematic feature which was later reverted. The next step was to see what percentage of this was recent, (say in 2017) and have real changesets version so that we can use it as part of the training/validation dataset in Gabbar.

  • There are 10,691 (76%) potentially problematic changesets that have real changeset.

@manoharuss @krishnanammala, need your help here. Can you randomly :eyes: about 100 changesets from this list to see what percentage of the 100 are problematic. This will help us understand what to expect and if this can be used as training dataset in Gabbar.

  • https://gist.github.com/bkowshik/bddc87ca4dd74c37d9ae097985a28edc

Next actions

  • [ ] Review 100 changesets randomly (some at the top, middle, end, etc)
  • [ ] What percentage of the 100 changesets are actually problematic?
  • [ ] Now that we know the percentage, can we use it in Gabbar - @bkowshik

cc: @planemad

bkowshik avatar Jun 21 '17 06:06 bkowshik

@bkowshik any changeset reverted by an experienced editor (>100edits) we can safely say was definitely a bad one. Lets use our time time more wisely to review only those that were reverted by a inexperienced user (<20 edits), this is where we might find some false negatives.

Other highly valuable questions to answer here:

  • What is the average response time for the community to fix a bad change?
  • Is there any corelation between the response time and user attributes like experience, mapping activity.. ?

cc @maning @batpad

planemad avatar Jun 21 '17 06:06 planemad

Thank you @planemad, that was super helpful!

Reverting changesets

  • Out of the total of 12257 changesets, only 2604 (21%) have real changesets.
  • There are a total of 468 mappers.
  • 2493 (95%) reverting changesets were by users with 100 or more changesets
  • 21 (1%) reverting changesets were from users with less than 20 changesets

The CSV with 21 reverting changesets by new users is at the link below:

  • https://gist.github.com/bkowshik/3bb6edd4bb12c5f4be712d64338d4614

Yes, there is a correlation between the experience of the user and number of reverting changesets. Reverting changesets are way more likely from experienced users than new users.

index

What is more interesting is that user_mapping_days has a stronger correlation at 0.6 to number of reverting changesets in comparison to user_changesets with a correlation of 0.3. So, the mapping days of the user is a stronger indicator.

index

bkowshik avatar Jun 21 '17 09:06 bkowshik

Reverted changesets

I couldn't resist finding who's changeset were getting reverted - the other side of the story.

  • Out of the total of 14062 changesets, only 10693 (76%) have real changesets
  • There are a total of 1301 mappers with one or more reverted changesets
  • 8679 (61%) reverted changesets were of users with 100 or more changesets
  • 1161 (8%) reverted changesets were of users with less than 20 changesets
  • User DACGroup had 60% of his/her changesets reverted, 4106 reverted changesets

The number of a users changesets getting reverted comes down as the user has more changesets, the user gains more mapping experience.

index

As expected, the user mapping days is negatively correlated, -0.3. Thus, higher a users mapping days, less likely of changeset being reverted.

index

bkowshik avatar Jun 21 '17 10:06 bkowshik

Per https://github.com/mapbox/gabbar/issues/66#issuecomment-310029426

There are 21 reverting changesets by users less than 20 changesets. @manoharuss @krishnanammala can you please 👀 these and post notes about what percentage of this 21 are actually problematic?

  • https://gist.github.com/bkowshik/3bb6edd4bb12c5f4be712d64338d4614

cc: @planemad

bkowshik avatar Jun 22 '17 16:06 bkowshik