Add `URLPatternList`
Closes #30
Mini-explainer:
This adds a new URLPatternList interface that is can be used to match many URLPatterns at once. This new interface can be used to easily create URL routers with a large number of patterns, without compromising on performance.
An example of how a user could use this API:
const patterns = [
new URLPattern("https://example.com/"),
new URLPattern("https://example.com/about"),
new URLPattern("https://example.com/blog/:slug"),
new URLPattern("https://example.com/blog/:slug/comments"),
];
const list = new URLPatternList(patterns);
let res;
res = list.exec("https://example.com/about");
patterns.indexOf(res.pattern); // 1
res = list.exec("https://example.com/blog/abc");
patterns.indexOf(res.pattern); // 2
res.pathname.groups; // { slug: "abc" }
list.test("https://example.com/imprint"); // false
list.test("https://example.com/about"); // true
list.test("https://example.com/blog/abc/comments"); // true
The main motivation for this addition is the performance improvements this can unlock compared to a naive matching algorithm based on the existing URLPattern interface. Existing user-land router implementation using URLPattern slow down linearly with the addition of patterns. URLPatternList is designed to be able to deliver sub linear performance characteristics for routers of any size.
TODOs
- [ ] Consensus on sort order
- [ ] Consensus on dynamic add and remove of patterns
- [ ] Check that an optimized implementation is possible (implement in Deno)
- [ ] Write web platform tests
Also:
check that a radix tree implementation is possible (implement in Deno)
I doubt the API shape here will really dictate whether a radix tree implementation is possible or not. Instead I expect the path-to-regexp style pattern matching will impose some constraints which make a pure radix solution difficult.
That does not mean optimizations will not be possible, though. I think there are clear optimizations a list object like this can enable; e.g. log(n) searching of a fixed string prefix in the patterns.
I have also theorized it might be possible to construct an optimization where the patterns are deconstructed into automata similar to regex compilation. Then something akin to a radix tree could be built from the automata. We could then evaluate the automata against the input in O(n) time where n is the input length.
Yeah, that makes total sense. My first pass was going to just deal with speeding up fixed text prefix matches (actually also for regular URLPattern).
I have also theorized it might be possible to construct an optimization where the patterns are deconstructed into automata similar to regex compilation. Then something akin to a radix tree could be built from the automata. We could then evaluate the automata against the input in O(n) time where n is the input length.
I was also thinking about that today! :)
I was actually planning to do exactly this for a v2 of the URLPattern(List) implementation in Deno. The way I envision it is to put another intermediate layer in between the "parts" and the "regexp" representation of a component. This intermediate format would be a layered "matcher" stack that is flat for URLPattern, and a proper tree for URLPatternList. You can then walk through this tree similar to a radix tree (except that some amount of backtracing is likely necessary). Each node in the tree would be some specific match construct. For example the simplest node would just be a literal string match. A more complex "optional" node would have an inner matcher that it would attempt to match before passing off to the next matcher in the stack. There would be a "regexp" node that just delegates a given match to the ECMA regexp engine (for user provided regexps).
What do you think about including this intermediate format in the spec? Doing that would allow us to make sure future spec additions / changes don't break non-regexp matcher algorithms. The spec would still only specify the naive regexp based matcher, but it'd be much easier for implementations to implement alternative faster matchers based on this intermediate format.
What do you think about including this intermediate format in the spec?
I don't think we should put any intermediate representation in the spec as it then makes future optimizations more difficult in the implementation. I think we want the flexibility to change the intermediate representation in the future. Also, none of this stuff should be observable to sites.
Also, when this is further along it would be nice to include a "mini explainer" in the pull request description. Something like https://github.com/whatwg/dom/pull/1032. And submit to TAG review.
Besides helping to get feedback on the proposal this would also make it easier for chromium to adopt the API more quickly.
@wanderview I simplified the proposed API. There is no more key anymore. Instead, the exec method now returns a pattern field in the result object that users can use to figure out which pattern matched. To do attach a key (or any metadata) to a pattern, users can now use a Map:
const patterns: Array<[URLPattern, string]>;
const keys = new Map(patterns);
const list = new URLPatternList(patterns.map(r => r[0]));
const res = list.exec(...)
if (res !== null) console.log("Matched", map.get(res.pattern));
I think @domenic had previously advocated the map approach to me. If this is ergonomic enough for developers, that works for me.
I have added a mini explainer to the PR description.
@lucacasonato What is the status of this? How did your prototyping in deno go? I'm thinking of prototyping this in chromium using re2's set implementation.
I was also wondering if we should rename this to URLPatternSet. My impression is we don't want to support duplicate entries.
I filed an intent to prototype for chromium:
https://groups.google.com/a/chromium.org/g/blink-dev/c/QrPrveVyFnA/m/PkaPKfJ4AwAJ
@domenic Do you think we should include setlike in the webidl here? (Again, I think we should probably make this URLPatternSet.)
Edit: Note, the URLPatternSet will be ordered, but not insertion order. It will instead be sorted based on a spec defined algorithm.
Also, I wonder if the Map oriented approach to associating data with entries will make it hard to have a constructor that takes dictionaries instead of full URLPatterns. Not sure if the dictionaries shortcut is worth optimizing for at the cost of added complexity elsewhere.
I've always been -1 on the map-oriented approach, myself.
Regarding setlike APIs, I think the main question is whether you want people to introspect and possibly mutate these things after creation. The spec PR here gives an immutable, non-introspectable object, and that seems to accomplish the main use cases. On the other hand, add introspection APIs from setlike (entries, forEach, has, keys, size, values) doesn't seem to hurt. And adding add, clear, and delete also doesn't seem that bad.
Note that if you're not using insertion order, you'll need to override the implementation of add(), in order to put the given value in the right place in the set entries.
I've always been -1 on the map-oriented approach, myself.
Really? I thought you suggested it to me before. I must have been confused. Should we use a maplike API instead?
While read-only is all we technically need, web developers have said making it mutable would be nice. Maybe starting read-only would be safest and add mutators later if needed. (For example I'm unsure if putting a duplicate entry in should overwrite or fail.)
I'm sorry, I misunderstood what you meant. I was -1 on changing the API of URLPatternList to be map-oriented. I am +1 on developers using maps separately as a side table.
Ok, I recall you were positive on passing in an array of init dictionaries to the constructor. But that doesn't seem to work with the external map approach. Does that seem ok to you?
To clarify, if you pass in an init dictionary and the constructor makes the URLPattern for you, then you don't have the pattern as a key in your Map.
Yeah, that seems OK to me. I think people who want to associate data that way can use explicit URLPattern objects.
Does this only do stateless longest matching?
Should it be possible to resume pattern matching from a specified pattern (lastPattern), like RegExp's lastIndex?
@lucacasonato what's the status of this?