nepomuk icon indicating copy to clipboard operation
nepomuk copied to clipboard

Extractor: clean up and normalize station names

Open daniel-j-h opened this issue 8 years ago • 4 comments

I just had a look at the Berlin GTFS feeds for this year.

https://daten.berlin.de/datensaetze/vbb-fahrplandaten-januar-2017-bis-dezember-2017 stops.txt, CC-BY 3.0 licensed: http://www.vbb.de/de/datei/GTFS_VBB_Jan_Dez2017.zip

I can see three concrete issues:

1/ In there stations for the U-Bahn are named U Alexanderplatz (Berlin), and other kind of stations e.g. bus lines have a different naming scheme. We probably don't want to show and store the (Berlin) suffix (what about the U prefix?) and want to associate a type with these stops. (Sidenote: other delimiters seem to be / and extra information in brackets [x] in this dataset!)

2/ There are multiple stops with almost the same name in there, with some diffs being only the number of spaces in stop name. We probably should trim and collapse multiple spaces within stop names.

3/ There are multiple stops for each stop name in the data. We probably can deduplicate based on their location (e.g. haversine < 500m is probably the same stop). How should we handle cases where the name is the same but the location is different?

The issues above are not specifically for the Berlin GTFS feeds — there's probably more out there.

Related:

  • Deduplication - https://github.com/mapbox/directions-transit/issues/45 station lookup by coordinate
  • Names - https://github.com/mapbox/directions-transit/issues/19 name to station / location to station

daniel-j-h avatar Mar 19 '17 17:03 daniel-j-h

@daniel-j-h did you check the stop->station relation for 3/? Multiple stops (platforms) may represent the same station and will be named the same. Your look-up should probably only include stations, not stops.

MoKob avatar Mar 20 '17 15:03 MoKob

Good point. No I did not, I simply worked with the stops.txt file for the prototype to get something going quickly. I just wrote a Python 10 liner to do this, nothing fancy. We should definitely do it properly here.

daniel-j-h avatar Mar 20 '17 15:03 daniel-j-h

What just came to my mind: we should check if we can use the locations in the GTFS feeds and query OSM for those locations in order to extract station information from OSM.

daniel-j-h avatar Mar 20 '17 18:03 daniel-j-h

Some prior art for cleaning station names in OSM and Wikidata, respectively:

https://github.com/mapbox/mapbox-streets-source/blob/7675b6a8369a8a84e6354b89050be1a826fb6729/pgsql/lib.sql#L280-L284 mapbox/mapbox-streets-source#748

/cc @ajashton

1ec5 avatar May 05 '17 00:05 1ec5