gabbar icon indicating copy to clipboard operation
gabbar copied to clipboard

Prototyping Gabbar for highway features

Open bkowshik opened this issue 8 years ago • 9 comments

One of the popular problems in machine learning is dogs vs cats; given a picture predict whether the picture is of a dog or a cat. Coming from this initial experience about machine learning, I kept thinking the problem of classification of changesets as good or problematic is something similar. But, today I did an exercise where I wanted to identify one attribute about the changeset that makes it good or problematic. I started with:

  • https://osmcha.mapbox.com/49563062/
  • highway=residential is modified to highway=unclassified
screen shot 2017-06-16 at 9 15 25 am

The following questions came to mind

  • What could be the source of knowledge to modify?
  • Isn't residential better than unclassified; I mean something is better than nothing right?
  • At version 15, this is quite a mature feature. So, is that alright?
  • What is the length of the highway; smaller should be residential and longer unclassified?
  • Why is source=google maps Really?

From https://wiki.openstreetmap.org/wiki/Key:highway

  • highway=unclassified

The least most important through roads in a country's system – i.e. minor roads of a lower classification than tertiary, but which serve a purpose other than access to properties. Often link villages and hamlets.

  • highway=residential

Roads which serve as an access to housing, without function of connecting settlements.

From https://osmlab.github.io/osm-deep-history/#/way/103217436

  • The feature has mostly been highway=unclassified since creation in 2011.
screen shot 2017-06-16 at 9 19 59 am

Looking deeper into other changesets where a highway=residential gets modified into highway=unclassified, I find this user, Порфирий who has lots of changesets with the same behavior. Interestingly, the user who added highway=residential is Порфирий too.

  • https://www.openstreetmap.org/user/Порфирий/history
screen shot 2017-06-16 at 9 30 27 am

Eureka!

When a highway modification has so many questions to answer and attributes to look at, what will the scale be when we look at all 26 primary tags together? What about features that don't have any primary tags? Too many questions! Too many attributes! Right?

  • This does not look a traditional cats vs dogs. It is a little something else.
  • How about we try something different? How about we build one machine learning model for each object type?
  • How would it look when there is a model trained on highway's to classify whether the new/modified highway is a :thumbsup: or a :thumbsdown:
  • Another trained on buildings, another in water bodies, etc and each knew what a good highway looks like and a problematic highway looks like?
  • Is this it?

cc: @anandthakker @geohacker @batpad

bkowshik avatar Jun 16 '17 05:06 bkowshik

In the dataset I had locally, found 36 changesets where highway=residential got modified to highway=unclassified. I 👀 a couple of these changesets.

  • https://gist.github.com/bkowshik/90e703ffd087c787636ad87eaa04c231

Notes

  • https://osmcha.mapbox.com/47392777/
screen shot 2017-06-16 at 12 45 29 pm
  • https://osmcha.mapbox.com/48346176/
  • Unsure if this is unclassified or residential
screen shot 2017-06-16 at 12 52 20 pm
  • https://osmcha.mapbox.com/48388526/
  • This should be a residential highway right?
  • Specially with the changeset comment "Add city roads"?
screen shot 2017-06-16 at 12 56 32 pm

bkowshik avatar Jun 16 '17 07:06 bkowshik

Attributes by action

There are 3 action types for a highway feature

  1. A new highway is created
  2. An existing highway is modified. Property and/or geometry modification
  3. An existing highway is deleted

There are some attributes that are dependent on the action type. For example, the difference in length of highway is only for action modification; there are no two versions of the highway to calculate difference when it is newly created. Next, what attributes are relevant or not when a highway is deleted? I am 🤔 won't a length_difference column be redundant for a newly created highway?

I am not sure how to solve this problem, would love to hear ideas. But, for a start I am planning to add just the attributes in the latest version of the model along with the action in create, modify or delete. Let's see how this goes. If these attributes are not sufficient, we could add other diff attributes like difference in highway length, distance between the centroids, etc.

bkowshik avatar Jun 17 '17 14:06 bkowshik

Very early results, 2 out of the 6 predicted in the sample are interesting.

  • https://osmcha.mapbox.com/48452572/
  • highway=residential goes inside a park
screen shot 2017-06-18 at 12 19 01 am
  • https://osmcha.mapbox.com/48299333/
  • Unusual rectangular shape of the highway
screen shot 2017-06-18 at 12 19 21 am

bkowshik avatar Jun 17 '17 18:06 bkowshik

Highway classifier v1

Dataset

  • Labelled samples: 2,732
  • Changesets labelled good: 2,655
  • Changesets labelled harmful: 77

Model

What did the model learn?

  • Table lists 10 attributes that the model thinks are the most important.
screen shot 2017-06-23 at 6 27 13 pm

How are the model metrics?

With previous runs, I trained the model on the training dataset and measured metrics on the validation dataset. But, because of the narrow scope of the problem, we have samples on the lower side. Thus, I went the route of Cross Validation.

  • Precision: 10% (Fraction of changesets harmfu labelled problematic)
  • Recall: 20% (Fraction of harmful changesets predicted harmful)

Results

From among the unlabelled testing dataset of , 6 out of 344 were predicted to be problematic. The results are interesting indeed.

  • Model is learning that a highway=footway and area=yes don't exist together! :tada:
screen shot 2017-06-23 at 5 46 04 pm
  • A demolished highway. Did not know something like that existed.
screen shot 2017-06-23 at 5 39 42 pm

bkowshik avatar Jun 23 '17 13:06 bkowshik

I experimented with scaling features using sklearn.preprocessing.StandardScaler

Without feature scaling

  • Precision on all samples: 0.037 (0.068)
  • Recall on all samples: 0.07 (0.131)

After feature scaling

  • Precision on all samples: 0.034 (0.048)
  • Recall on all samples: 0.052 (0.064)

Feature scaling does seem to have a small impact. Even through the mean scores come down, the standard deviation are down as well.

bkowshik avatar Jun 28 '17 08:06 bkowshik

460 out of the total 2732 (17%) samples had a modification in name, which includes name additions, modifications and deletions. 22 of the 77 (28.57%) harmful changesets were name modifications. I added an attribute called feature_name_modified to see if that helps. The model put the feature_name_modified at the 5th position in the importance list.

screen shot 2017-06-28 at 3 57 37 pm

The model metrics did not show a significant variation.

  • Precision on all samples: 0.058 (0.113)
  • Recall on all samples: 0.054 (0.092)

bkowshik avatar Jun 28 '17 10:06 bkowshik

Error analysis

False negatives (14)

  • Harmful due to geometry: 3
  • Harmful due to feature name: 9
  • Fixme was removed: 1
  • highway=footway: 1
screen shot 2017-06-30 at 9 40 21 am

Feature is not good because of personal information in the name tag

True positives (43)

  • Highway classification modified: 18
  • Harmful due to feature name: 1
  • Some other feature made a highway: 1
  • Highway made a some other feature Ex: river: 13
  • Harmful due to geometry: 1
  • Some property of highway is modified Ex: oneway: 4
screen shot 2017-06-30 at 10 05 52 am

Harmful change when a highway feature becomes something else

bkowshik avatar Jun 30 '17 04:06 bkowshik

The following gist has a random sample of 25 predictions from the first version of the highway classifier. The csv has both the changeset_id and feature_id.

@krishnanammala can you 👀 these changesets on osmcha and give me some feedback?

  • https://gist.github.com/bkowshik/16f1dc675d9a01e92cef6cee2569a2b9

cc: @planemad @batpad

bkowshik avatar Jul 04 '17 07:07 bkowshik

As per comment https://github.com/mapbox/gabbar/issues/69#issuecomment-312801138 above , I have gone through the changesets that are flagged by the Gabbar (Highway classifier). Here are my observations:

  • Total number of changesets reviewed: 26
  • No. of changesets found harmful : 2

The both harmful changesets are deletions of turn:lanes & lanes tags and both of them are from the same user.

I have outlined the detections in much clear way segregating them under Good detections and detections with less priority so that it helps @bkowshik getting more context in terms of improvement.

Good detections detections with less priority
  • Deletion of area tags to highways
Geometry of highways changing
  • Junction=roundabout tag deleted
highways with rest_areas & traffic signals which are less priority
  • Classification of highways (higher -> lower) i.e., residential to unclassified
Addition of layer tags to minor highways i.e., service roads
  • Addition of turn:lanes
Addition and modification of low classification highways i.e., Tracks,paths,service roads

Hope the above observations will help you @bkowshik 👍

krishnanammala avatar Jul 05 '17 10:07 krishnanammala