To actually route in a public transportation network, we require the translation from location / query to stations.

I see a few different scenarios that we might want to support.

Name To Station

Similarity measures

Entering the name of a station, or parts of it, we should be able to find the best match in the data.

For any string s and station S, we need a function that determines the likelihood of s referring to S. We could use different synonyms (e.g. S+U Stationname and Stationname) to refer to the same station.

Similarity could further use techniques like stemming, phonetics matching or similar to distinguish between possible candidates.

A possible hint to further strengthen the look-up could be the distance to the GPS position.

Selecting stations

User enters name 'n'. If name 'n' can be matched with a high enough certainty, we directly select the station. If the likelihood of many stations is high enough, we should output a sorted list of possible candidates.

Possible Features

By supporting multiple likely results (staring at maybe like 2/3 characters), we could implement autocomplete.

Location to Station

For the look-up location to station we would need integration with walking durations. What I imagine is a preselection phase to select a Pareto-set of possible connections. For every line we can select the nearest (maybe two nearest/maybe per direction?) stations. Within a maximal walking radius, we can pre-filter a set of candidates.

These candidates then need to be checked for their walking distances. This could be done with a walking table query to OSRM/mapbox-directions.

This gives a set of initial data. Next to this initial data, we also require a check for the direct connection (if it is close enough) in walking.

After a search for the best connections, we can then augment our response by querying the walk-api for an actual set of walking instructions to the stations/from the stations.

/ cc @daniel-j-h

Mar 03 '17 12:03 MoKob

I'd like to give the Name->Station task a shot when I can make some time for it.

I think all that's needed for this service is the stop.proto messages. The lookup will then work based on name and location and return matching ids (and maybe matching metadata such as scores, corrections for the match when we fuzzy match etc.).

I want to have at least scalable and fast Fuzzy Matching of some sort.

If we want to be fancy with NLP and do stemming, etc. I'd consider writing this service in Python and using existing libraries. Also we should think about if our matcher should work worldwide or (for now) in Europe/US only.

Mar 03 '17 14:03 daniel-j-h

Sooo.. taking the train to Bavaria I had some time to think about this.

Here's a first prototype for a fuzzy autocompleter:

https://gist.github.com/daniel-j-h/c626acfd32b51e7e1fe9e0aedd557514

It's a stupid simple gram-based inverted index and sorting by Jaccard similarity between the gram sets.

It's super inefficient doing full sorts where nth_element-equivalents would do, does not do any NTP pre-processing other than lower-casing and other atrocities (I blame the train ride).

But: it already works amazingly well! Give it a try e.g. on Berlin!

Mar 05 '17 18:03 daniel-j-h

There's another point I totally forgot to mention: Unicode awareness. We should not limit us to ASCII in 2017. The prototype already handles bigrams based on Unicode code points (and not bytes), but we probably should do some NFKC/NFKD normalization before indexing and querying, too.

http://unicode.org/reports/tr15/
https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

Mar 06 '17 09:03 daniel-j-h

@daniel-j-h if you manage to do emoji look-up, that would be THE addition to the framework :)

🌞 🗼 🚄 🚄 🗻 : how do I get from tokyo by train to mount fuji around noon

Mar 08 '17 10:03 MoKob

Capturing from chat: what I can imagine for a first prototype is having

Stop names
Stop locations
(Normalized) stop "importance" (e.g. degree in the transit network)

then we can build a service for doing fuzzy auto completion:

User request: name string, optional location hint (derived in client from GPS / map view, etc.)
Service response: list of (name, confidence) pairs, can be empty

The results should then be ranked by a linear combination of

Matching grams
Weights
Proximity to location hint

If you check the FindProtobuf CMake module there is a CMake function to generate Python stubs based on the .proto schemas. This should allow us to build a Python3 based service based on the techniques outlined above.

Mar 08 '17 17:03 daniel-j-h

From first impressions of the data. We should be able to handle a set of situations for the look-ups:

street corners: find stops of the form missing and first / mission & first
handle written numbers / non-written numbers: 1st / first

In addition, we should find ways to pre-filter data for unexpected characters. The Berlin feed for example lists (Berlin) on all stops.

Mar 10 '17 14:03 MoKob

Client side example of how this would look and feel like

:train2: https://daniel-j-h.github.io/berlin-u-stops/

Should probably be done on the backend except we know the client's position and can prefetch stops and do it on a constrained area locally. Then it can be done even when there is no connectivity.

Not sure about performance, though. This is only on hundreds of U stops. It will probably not scale to more than some thousands. At least not this unoptimized fuzzy string matching running in the browser.

Mar 19 '17 21:03 daniel-j-h

Implement Station Look-Up

Name To Station

Similarity measures

Selecting stations

Possible Features

Location to Station