[Discussion] Support lateral joins and parent_as in association's joins_query
Now you can create a custom Ecto.Association that uses lateral joins:
defmodule MyAssoc do
@behaviour Ecto.Association
def joins_query(%{owner: owner, owner_key: owner_key, related: related} = assoc, join_parent) do
import Ecto.Query
subquery = from r in related,
where: field(r, ^owner_key) == field(parent_as(^join_parent), :id),
join: p in assoc(r, :profile),
select: %{
data: fragment("jsonb_object_agg(?, ?)", p.slug, r.data)
},
group_by: field(r, ^owner_key)
from o in owner, inner_lateral_join: ^subquery(subquery), on: true
end
# ...
end
Motivation
Today I've been playing around with a custom Ecto.Association (as a module behaviour) to solve a performance issue.
The gist of the performance issue is that we've been using a postgres view which has a schema in elixir, and we define a simple has_one association from another schema to that view, to be able to preload this view or sometimes add joins to it. The problem with this approach is that due to the nature of the view's query, the postgresql planner doesn't filter the rows by the foreign key written in the ON-clause of the join, but enumerates all the records of the view in memory (120k on our machine) and only then applies the filter from the on clause (leaving only one record). Of course, this takes a lot of time, about 1 seconds for this query vs. about 1ms for an optimized query that uses only index scans.
The optimized query uses JOIN LATERAL and applies the filter inside the subquery, so the planner can use indexes on that table.
It would be a very simple fix for us, if custom associations could support lateral joins and parent_as, because we'd only have to write the assoc and leave the rest of the source code untouched.
In my opinion, only 2 changes are necessary, and I've already written some code in ecto that shows promising results.
Both changes are in the query planner, when it calls the joins_query callback of Ecto.Association:
First, the planner needs to merge a desired lateral join from the association with the desired join direction from the query. Only left and inner lateral joins are supported, so if the query wants an outer or right join, while the assoc wants a lateral join, we can raise some error.
Secondly, to support referencing a column from the assoc source, there needs to be some way to tell the joins_query callback the index of the assoc source. I solved this by adding a second argument to the callback, for backwards compatibility, the unary joins_query callback will be called when the binary isn't defined. But maybe there's a better solution for this that doesn't need to change the API so much.
I submit this PR after 3 hours of work, before putting in more effort, to see if it would be an acceptable way of solving this problem first.
Without having too much time to look at it right now, my feeling is we can probably support something like this but in a simpler way.
Since we already generate queries for preloaded associations, we should be able to do wrap it in a subquery when we detect a lateral join. And I believe we already have the binding since this will have to be a single query preload so we should be able to use that in parent_as.
For preloads I'm not so worried about it, for me it's mainly about the join: x in assoc(y, :z) syntax, that we use all over the codebase for this association to the view.
So the idea is that, if someone does lateral_join: x in assoc(x, ...), we automatically write it as lateral join? Or what if we introduce a lateral_assoc helper?
The usage of my custom assoc works like this:
in the schema:
defmodule Book do
import Book.CommentStatsAssoc
schema "book" do
comment_stats :comment_stats
end
end
and in the query:
from b in Book, join: s in assoc(b, :comment_stats)
then the query would be planned like this:
from b in Book,
as: :book, #alias is actually a ref added by the planner
inner_lateral_join: s in subquery(
from c in Comment,
where: c.book_id == parent_as(:book).id,
select: %{count: count(c.id)}
),
on: true
For me, it would be nice if I didn't have to write inner_lateral_join in the query, since it's deeply embedded in our codebase already. However, I can see how having your join "upgraded" to a lateral join could also be confusing for developers.
To be able to define a lateral joined association directly in the schema would be great too!
I also thought about the problem with the parent_as, maybe instead of adding an alias to the parent table all the time, the planner could transform an alias found in the query returned by joins_query into an alias from the assoc parent ix to the same name.
def joins_query(assoc) do
from b in Book, as: :__lateral_assoc_book__,
inner_lateral_join: s in subquery(
from c in Comment,
where: c.book_id == parent_as(:__lateral_assoc_book__).id,
select: %{count: count(c.id)}
)
end
This is what I initially tried to write, but then I saw that the planner only takes the joins from the returned query and discards anything else. If the aliases were transformed as well, it could work (also if the lateral join references another join instead of the root table).
Then we don't need to introduce another callback with more arity.
The issue is that the Ecto.Association API is private, we use it internally, but it isn't public. So I am worried about adding new callbacks that ourselves don't use, it would even be hard to test them. So I would prefer to surface the feature using proper APIs with some docs. :)
Yeah adding a new callback isn't my favorite solution either. I think transforming the aliases is a much nicer and more readable solution.
I can also look into how a lateral helper could work which uses the new capabilities internally.
schema do
lateral_one :comment_stats do
from b in Book,
as: :book,
inner_lateral_join: s in subquery(
from c in Comment,
where: c.book_id == parent_as(:book).id,
select: %{count: count(c.id)}
),
on: true
end
end
and a similar lateral_many macro as well.
It feels off to me to force the join type in the schema. It's so common to want to use the same table for many different joins that I think people will almost always have to duplicate the association just to do what they want.
As soon as you have this association that uses a subquery with a filter that references some column from other queries, you can't use anything but lateral joins in SQL anyway.
It should be clear to users that a lateral association can only be used with lateral joins.
Right, which is why I proposed lateral_assoc(...) (which we can validate as part of a lateral join). :)
For example:
posts has an association on comments
sometimes I want the association to be joined in an inner join
sometimes I want the association to be joined in a lateral join
If I define it in the schema then I need 2 associations. If the join is specified in the query then I only need 1 association and I can have a helper that wraps it in a subquery and uses parent_as when I want the lateral join.
sometimes I want the association to be joined in an inner join
How would you join a subquery that can only be used in a lateral join with an inner join? I feel like we're not on the same page.
Something you might not be aware of is that today Ecto ~~will store~~ has a default query for each association. It's how separate query preloads work when you do something like this
from p in Post, preload: :comments
So what I am saying is that there is a helper that lets you leverage this query when you want to use the association in a lateral join. It should wrap it in a subquery and deal with parent_as.
I think you are very fixed on the subquery being part of the association definition. That is exactly what I am saying shouldn't happen.
@josevalim I started looking at ways to implement this. We could have a lateral_assoc helper like this
from p in Post, as: :post, inner_lateral_join: c in lateral_assoc(p, :comments, :post)
which would get turned into this
from p in Post, as: :post,
inner_lateral_join: c in subquery(from c in Comments,
where: c.post_id == parent_as(:post).id)
But we could not add a preload on it like this
from p in Post, as: :post,
inner_lateral_join: c in subquery(from c in Comments,
where: c.post_id == parent_as(:post).id),
preload: [comments: c]
because we can't preload with subquery sources since we can't find the primary keys.
Do you think it is worth it to have a helper like this? I am on the fence. It is a nice abstraction over a potentially annoying subquery. But it is a decent downside not being able to preload.
We can alternatively (but with the same downsides) keep using assoc but expand it to be assoc/3 and take in a parent binding. Then based on the join qualifier we raise or not if they didn't provide the parent binding. The joins with assoc/2 already don't work with the lateral join qualifiers because they are not wrapped in subqueries. So it might be a good way to round out that helper.
I thought of one way to get a lateral assoc helper to work with preloads. We could tag the subquery as coming from these helpers. Then when extracting the source during the preload we know it's safe to use inner_query.from.source.
I think there are two issues here.
The first one is about abstracting a common piece of code:
from p in Post, as: :post,
inner_lateral_join: c in subquery(from c in Comments,
where: c.post_id == parent_as(:post).id)
into something smaller:
from p in Post, as: :post,
inner_lateral_join: lateral_assoc(p, :comments, :post)
My concern is: what if I want to further modify the subquery? I think this would trap us.
The second one is allowing preloads from subqueries:
from p in Post, as: :post,
inner_lateral_join: c in subquery(from c in Comments,
where: c.post_id == parent_as(:post).id),
preload: [comments: c]
I think this is something we would like to support anyway, if we can.
However, I have to ask, are we gaining anything for doing the lateral joins? Joins are more expensive because they duplicate the "left side". We could achieve similar result using windows. Or what if we could support lateral_preload? That builds the preload internally as a separate query? Are these possible?
For example, we could do this:
Repo.lateral_preload(posts, comments: Comment)
Then we build a query passing the post ids and we still use a lateral join, but we won't be duplicating the returned post rows.
Another idea is to augment preload to say, if a subquery is given, it will be treated as a lateral join (we can compile to windows in other adapters):
preload: [comments: subquery(Comment)]
If we find this too magical, we could introduce a lateral function:
preload: [comments: lateral(Comment)]
If we add such a new function, we could also use lateral in queries:
inner_join: c in lateral(from c in Comments, ...)
inner_join: c in lateral("SELECT * ...")
lateral would return a Ecto.Query.Lateral and it will either wrap a string (representing a fragment) or a query (or a subquery? 🤔 ). This way we could even deprecate inner_lateral, left_lateral, and cross_lateral in the future.
Thanks for the response. I'll try to group my thoughts into big buckets
Is it worth it to support lateral join preloads
I believe it is. Even though the same end result can be obtained using window functions, the way the results are grabbed are quite different and lateral joins will be much more performant in some cases. This is the example I tend to refer to for this.
Duplicating the left side of the join
There will be cases where lateral join can be used without duplicating. For example if you join Company onto Movie and want to find the most recent movie per company. I also think it has parity with the other joins, where people can choose if the duplication is worth it on a case by case basis.
Preloading a subquery in a join
I actually tried this route first and where I got stuck is we are using primary keys to match the parent/child records. And we have no universal way to get the primary key for a subquery. It would have to be marked as a special case somehow, which is where I really like your idea to have something like from s in Source, inner_join: a in lateral(Assoc), preload: [assoc: a]. That will let us mark it as special and we will know how to extract the source from the subquery.
Whether we need a separate query lateral preload
I have to look at this closer. My initial thought is I don't know if we need it. I'm pretty sure the separate query preload is already flexible enough to let the user do this through a custom query. So then it's whether or not adding the shortcut is worth it. And I'm not too sure atm.
Conlusion
I would like to investigate using a new lateral function like you suggested for join queries only atm.
Thanks for continuing to work on this.
My original intention to go the route with the custom assoc was to avoid having to rewrite every query in our app that uses joins the association backed with a view, which is quite slow.
But now I see that it's not really feasible, and I will look into rewriting them all using the join_lateral keyword we have today (not as simple as it sounds, since the queries are generated quite dynamically).
However I'm excited for the lateral preloads, this could also give us a nice performance boost in the same area.
Thanks @greg-rychlewski. I understand that lateral joins may be valid in the same query, when they return a single result, but that's already possible today and then we need to ask if that's the pattern we should promote. I would rather promote something that will work better for most cases. :)
I actually tried this route first and where I got stuck is we are using primary keys to match the parent/child records. And we have no universal way to get the primary key for a subquery.
What if we support assoc(p, :comments, subquery)?
I'm pretty sure the separate query preload is already flexible enough to let the user do this through a custom query.
Unfortunately it is not, unless we use windows or something. It is possible with a function, if we use unnest(ids) and do a lateral join, which is what I would propose for us to encapsulate. We could even introduce a Ecto.Preloader protocol, so we could pass anything as the preload value, and it would invoke the protocol. So we could do:
preload(comments: any_value)
And we call the protocol with the any_value, the IDs, and the relation struct. And then anyone could implement preload(comments: lateral(...)).
For the separate preload query, I think I'm probably not seeing something you are. The biggest blocker for me mentally is that the behaviour of lateral joins depends on it being part of the same query. For instance this query:
from p in Post, lateral_join: s in subquery(...)
It's essentially a for loop like this
for row in Post
evaluate subquery with current values of row
I'm having a hard time picturing how we would do this with a separate query without it being a duplicate of the main query. Because it wouldn't be a single evaluation of a query with a list of ids. It would need to evaluate it once for each id and also need access to the other values in the row. I'm not sure if you had a particular example in mind?
I was thinking you would write it as select * from unnest($1) as id left lateral join (select … from comments where c.post_id == id …). Wouldn’t that work?
Oh unnest will only work on array types. So databases like MySQL won't be able to do that. I think I get what you mean though. Basically the preload query will still do the join but not send back the entire row duplicated. Only the primary key.
I'm not too sure if there is a consistent way to generate a table structure from a list of ids across all databases. We can maybe do the join on the parent itself selecting only its id: select parent.id, child.* from parent inner lateral join (select * from child where child.parent_id = parent_id) on true where parent.id in (...). This way we would also have access to the other fields in the parent that can be referenced by the subquery. But we have to be sure the planner won't do something weird like join everything and then filter the ids after. It also feels a bit weird to traverse the parent in the separate query because it's not free. So whether or not creating the extra query is worth it might be a bit more uncertain than a normal join.
I think if we do separate preload query for lateral we also want to enforce they are providing a subquery. Because let's say they just provide the association name or struct or something like that. That would possibly hurt their performance because there's no point to doing a lateral join unless you are doing some kind of windowing calculations. If there is nothing like that then you are essentially forcing the db to do a regular join in a nested loop fashion instead of letting it choose the best strategy.
Agreed on everything you said. Perhaps the smart choice for now is to allow a 2-arity function in preloader that receives the IDs and the association metadata, so people can build their own lateral preloaders?
That sounds good to me. Would you mind double checking I am understanding everything correctly:
Current Situation
- Users cannot use a custom preload query because they don't have control over where the parent ids go. Ecto injects
where: child.parent_id in parent_idsinto the preload query automatically - Users can use a custom preload function, but it cannot be general because it's only passed the parent ids and not the information about the association.
Suggested Change
- If we add a 2-arity custom preload function then we can pass the association information and users can have one function to handle all lateral preloads
And as an aside, would you still be interested in adding this for join preloads:
What if we support assoc(p, :comments, subquery)?
If we add a 2-arity custom preload function then we can pass the association information and users can have one function to handle all lateral preloads
Yes!
And as an aside, would you still be interested in adding this for join preloads:
Yes!