Prototyping Gabbar for highway features
One of the popular problems in machine learning is dogs vs cats; given a picture predict whether the picture is of a dog or a cat. Coming from this initial experience about machine learning, I kept thinking the problem of classification of changesets as good or problematic is something similar. But, today I did an exercise where I wanted to identify one attribute about the changeset that makes it good or problematic. I started with:
- https://osmcha.mapbox.com/49563062/
-
highway=residentialis modified tohighway=unclassified
The following questions came to mind
- What could be the source of knowledge to modify?
- Isn't
residentialbetter thanunclassified; I mean something is better than nothing right? - At version
15, this is quite a mature feature. So, is that alright? - What is the length of the highway; smaller should be residential and longer unclassified?
- Why is
source=google mapsReally?
From https://wiki.openstreetmap.org/wiki/Key:highway
- highway=unclassified
The least most important through roads in a country's system – i.e. minor roads of a lower classification than tertiary, but which serve a purpose other than access to properties. Often link villages and hamlets.
- highway=residential
Roads which serve as an access to housing, without function of connecting settlements.
From https://osmlab.github.io/osm-deep-history/#/way/103217436
- The feature has mostly been
highway=unclassifiedsince creation in 2011.
Looking deeper into other changesets where a highway=residential gets modified into highway=unclassified, I find this user, Порфирий who has lots of changesets with the same behavior. Interestingly, the user who added highway=residential is Порфирий too.
- https://www.openstreetmap.org/user/Порфирий/history
Eureka!
When a highway modification has so many questions to answer and attributes to look at, what will the scale be when we look at all 26 primary tags together? What about features that don't have any primary tags? Too many questions! Too many attributes! Right?
- This does not look a traditional cats vs dogs. It is a little something else.
- How about we try something different? How about we build one machine learning model for each object type?
- How would it look when there is a model trained on highway's to classify whether the new/modified highway is a :thumbsup: or a :thumbsdown:
- Another trained on buildings, another in water bodies, etc and each knew what a good highway looks like and a problematic highway looks like?
- Is this it?
cc: @anandthakker @geohacker @batpad
In the dataset I had locally, found 36 changesets where highway=residential got modified to highway=unclassified. I 👀 a couple of these changesets.
- https://gist.github.com/bkowshik/90e703ffd087c787636ad87eaa04c231
Notes
- https://osmcha.mapbox.com/47392777/
- https://osmcha.mapbox.com/48346176/
- Unsure if this is
unclassifiedorresidential
- https://osmcha.mapbox.com/48388526/
- This should be a
residentialhighway right? - Specially with the changeset comment "Add city roads"?
Attributes by action
There are 3 action types for a highway feature
- A new highway is
created - An existing highway is
modified. Property and/or geometry modification - An existing highway is
deleted
There are some attributes that are dependent on the action type. For example, the difference in length of highway is only for action modification; there are no two versions of the highway to calculate difference when it is newly created. Next, what attributes are relevant or not when a highway is deleted? I am 🤔 won't a length_difference column be redundant for a newly created highway?
I am not sure how to solve this problem, would love to hear ideas. But, for a start I am planning to add just the attributes in the latest version of the model along with the action in create, modify or delete. Let's see how this goes. If these attributes are not sufficient, we could add other diff attributes like difference in highway length, distance between the centroids, etc.
Very early results, 2 out of the 6 predicted in the sample are interesting.
- https://osmcha.mapbox.com/48452572/
-
highway=residentialgoes inside a park
- https://osmcha.mapbox.com/48299333/
- Unusual rectangular shape of the highway
Highway classifier v1
Dataset
- Labelled samples:
2,732 - Changesets labelled good:
2,655 - Changesets labelled harmful:
77
Model
What did the model learn?
- Table lists 10 attributes that the model thinks are the most important.
How are the model metrics?
With previous runs, I trained the model on the training dataset and measured metrics on the validation dataset. But, because of the narrow scope of the problem, we have samples on the lower side. Thus, I went the route of Cross Validation.
- Precision:
10%(Fraction of changesets harmfu labelled problematic) - Recall:
20%(Fraction of harmful changesets predicted harmful)
Results
From among the unlabelled testing dataset of , 6 out of 344 were predicted to be problematic. The results are interesting indeed.
- Model is learning that a
highway=footwayandarea=yesdon't exist together! :tada:
- A
demolishedhighway. Did not know something like that existed.
I experimented with scaling features using sklearn.preprocessing.StandardScaler
Without feature scaling
- Precision on all samples: 0.037 (0.068)
- Recall on all samples: 0.07 (0.131)
After feature scaling
- Precision on all samples: 0.034 (0.048)
- Recall on all samples: 0.052 (0.064)
Feature scaling does seem to have a small impact. Even through the mean scores come down, the standard deviation are down as well.
460 out of the total 2732 (17%) samples had a modification in name, which includes name additions, modifications and deletions. 22 of the 77 (28.57%) harmful changesets were name modifications. I added an attribute called feature_name_modified to see if that helps. The model put the feature_name_modified at the 5th position in the importance list.
The model metrics did not show a significant variation.
- Precision on all samples: 0.058 (0.113)
- Recall on all samples: 0.054 (0.092)
Error analysis
False negatives (14)
- Harmful due to geometry: 3
- Harmful due to feature name: 9
- Fixme was removed: 1
-
highway=footway: 1
Feature is not good because of personal information in the name tag
True positives (43)
- Highway classification modified: 18
- Harmful due to feature name: 1
- Some other feature made a highway: 1
- Highway made a some other feature Ex:
river: 13 - Harmful due to geometry: 1
- Some property of highway is modified Ex:
oneway: 4
Harmful change when a highway feature becomes something else
The following gist has a random sample of 25 predictions from the first version of the highway classifier. The csv has both the changeset_id and feature_id.
@krishnanammala can you 👀 these changesets on osmcha and give me some feedback?
- https://gist.github.com/bkowshik/16f1dc675d9a01e92cef6cee2569a2b9
cc: @planemad @batpad
As per comment https://github.com/mapbox/gabbar/issues/69#issuecomment-312801138 above , I have gone through the changesets that are flagged by the Gabbar (Highway classifier). Here are my observations:
- Total number of changesets reviewed: 26
- No. of changesets found harmful : 2
The both harmful changesets are deletions of turn:lanes & lanes tags and both of them are from the same user.
I have outlined the detections in much clear way segregating them under Good detections and detections with less priority so that it helps @bkowshik getting more context in terms of improvement.
| Good detections | detections with less priority |
|---|---|
|
Geometry of highways changing |
|
highways with rest_areas & traffic signals which are less priority |
|
Addition of layer tags to minor highways i.e., service roads |
|
Addition and modification of low classification highways i.e., Tracks,paths,service roads |
Hope the above observations will help you @bkowshik 👍