ecto icon indicating copy to clipboard operation
ecto copied to clipboard

[Discussion] Support lateral joins and parent_as in association's joins_query

Open narrowtux opened this issue 1 year ago • 13 comments

Now you can create a custom Ecto.Association that uses lateral joins:

defmodule MyAssoc do
  @behaviour Ecto.Association

  def joins_query(%{owner: owner, owner_key: owner_key, related: related} = assoc, join_parent) do
    import Ecto.Query
    subquery = from r in related,
      where: field(r, ^owner_key) == field(parent_as(^join_parent), :id),
      join: p in assoc(r, :profile),
      select: %{
        data: fragment("jsonb_object_agg(?, ?)", p.slug, r.data)
      },
      group_by: field(r, ^owner_key)

    from o in owner, inner_lateral_join: ^subquery(subquery), on: true
  end

  # ...
end

Motivation

Today I've been playing around with a custom Ecto.Association (as a module behaviour) to solve a performance issue.

The gist of the performance issue is that we've been using a postgres view which has a schema in elixir, and we define a simple has_one association from another schema to that view, to be able to preload this view or sometimes add joins to it. The problem with this approach is that due to the nature of the view's query, the postgresql planner doesn't filter the rows by the foreign key written in the ON-clause of the join, but enumerates all the records of the view in memory (120k on our machine) and only then applies the filter from the on clause (leaving only one record). Of course, this takes a lot of time, about 1 seconds for this query vs. about 1ms for an optimized query that uses only index scans.

The optimized query uses JOIN LATERAL and applies the filter inside the subquery, so the planner can use indexes on that table.

It would be a very simple fix for us, if custom associations could support lateral joins and parent_as, because we'd only have to write the assoc and leave the rest of the source code untouched.

In my opinion, only 2 changes are necessary, and I've already written some code in ecto that shows promising results.

Both changes are in the query planner, when it calls the joins_query callback of Ecto.Association:

First, the planner needs to merge a desired lateral join from the association with the desired join direction from the query. Only left and inner lateral joins are supported, so if the query wants an outer or right join, while the assoc wants a lateral join, we can raise some error.

Secondly, to support referencing a column from the assoc source, there needs to be some way to tell the joins_query callback the index of the assoc source. I solved this by adding a second argument to the callback, for backwards compatibility, the unary joins_query callback will be called when the binary isn't defined. But maybe there's a better solution for this that doesn't need to change the API so much.

I submit this PR after 3 hours of work, before putting in more effort, to see if it would be an acceptable way of solving this problem first.

narrowtux avatar Apr 10 '24 14:04 narrowtux

Without having too much time to look at it right now, my feeling is we can probably support something like this but in a simpler way.

Since we already generate queries for preloaded associations, we should be able to do wrap it in a subquery when we detect a lateral join. And I believe we already have the binding since this will have to be a single query preload so we should be able to use that in parent_as.

greg-rychlewski avatar Apr 11 '24 01:04 greg-rychlewski

For preloads I'm not so worried about it, for me it's mainly about the join: x in assoc(y, :z) syntax, that we use all over the codebase for this association to the view.

narrowtux avatar Apr 11 '24 07:04 narrowtux

So the idea is that, if someone does lateral_join: x in assoc(x, ...), we automatically write it as lateral join? Or what if we introduce a lateral_assoc helper?

josevalim avatar Apr 11 '24 07:04 josevalim

The usage of my custom assoc works like this:

in the schema:

defmodule Book do
  import Book.CommentStatsAssoc 

  schema "book" do
    comment_stats :comment_stats    
  end 
end

and in the query:

from b in Book, join: s in assoc(b, :comment_stats)

then the query would be planned like this:

from b in Book, 
  as: :book, #alias is actually a ref added by the planner
  inner_lateral_join: s in subquery(
    from c in Comment, 
      where: c.book_id == parent_as(:book).id, 
      select: %{count: count(c.id)}
  ),
  on: true

For me, it would be nice if I didn't have to write inner_lateral_join in the query, since it's deeply embedded in our codebase already. However, I can see how having your join "upgraded" to a lateral join could also be confusing for developers.

To be able to define a lateral joined association directly in the schema would be great too!

narrowtux avatar Apr 11 '24 08:04 narrowtux

I also thought about the problem with the parent_as, maybe instead of adding an alias to the parent table all the time, the planner could transform an alias found in the query returned by joins_query into an alias from the assoc parent ix to the same name.

def joins_query(assoc) do
  from b in Book, as: :__lateral_assoc_book__,
    inner_lateral_join: s in subquery(
      from c in Comment, 
        where: c.book_id == parent_as(:__lateral_assoc_book__).id, 
        select: %{count: count(c.id)}
    )
end

This is what I initially tried to write, but then I saw that the planner only takes the joins from the returned query and discards anything else. If the aliases were transformed as well, it could work (also if the lateral join references another join instead of the root table).

Then we don't need to introduce another callback with more arity.

narrowtux avatar Apr 11 '24 08:04 narrowtux

The issue is that the Ecto.Association API is private, we use it internally, but it isn't public. So I am worried about adding new callbacks that ourselves don't use, it would even be hard to test them. So I would prefer to surface the feature using proper APIs with some docs. :)

josevalim avatar Apr 11 '24 08:04 josevalim

Yeah adding a new callback isn't my favorite solution either. I think transforming the aliases is a much nicer and more readable solution.

I can also look into how a lateral helper could work which uses the new capabilities internally.

schema do
  lateral_one :comment_stats do
    from b in Book, 
      as: :book,
      inner_lateral_join: s in subquery(
        from c in Comment, 
          where: c.book_id == parent_as(:book).id, 
          select: %{count: count(c.id)}
      ),
      on: true
  end
end

and a similar lateral_many macro as well.

narrowtux avatar Apr 11 '24 08:04 narrowtux

It feels off to me to force the join type in the schema. It's so common to want to use the same table for many different joins that I think people will almost always have to duplicate the association just to do what they want.

greg-rychlewski avatar Apr 11 '24 11:04 greg-rychlewski

As soon as you have this association that uses a subquery with a filter that references some column from other queries, you can't use anything but lateral joins in SQL anyway.

It should be clear to users that a lateral association can only be used with lateral joins.

narrowtux avatar Apr 11 '24 12:04 narrowtux

Right, which is why I proposed lateral_assoc(...) (which we can validate as part of a lateral join). :)

josevalim avatar Apr 11 '24 12:04 josevalim

For example:

posts has an association on comments

sometimes I want the association to be joined in an inner join

sometimes I want the association to be joined in a lateral join

If I define it in the schema then I need 2 associations. If the join is specified in the query then I only need 1 association and I can have a helper that wraps it in a subquery and uses parent_as when I want the lateral join.

greg-rychlewski avatar Apr 11 '24 12:04 greg-rychlewski

sometimes I want the association to be joined in an inner join

How would you join a subquery that can only be used in a lateral join with an inner join? I feel like we're not on the same page.

narrowtux avatar Apr 11 '24 13:04 narrowtux

Something you might not be aware of is that today Ecto ~~will store~~ has a default query for each association. It's how separate query preloads work when you do something like this

from p in Post, preload: :comments

So what I am saying is that there is a helper that lets you leverage this query when you want to use the association in a lateral join. It should wrap it in a subquery and deal with parent_as.

I think you are very fixed on the subquery being part of the association definition. That is exactly what I am saying shouldn't happen.

greg-rychlewski avatar Apr 11 '24 13:04 greg-rychlewski

@josevalim I started looking at ways to implement this. We could have a lateral_assoc helper like this

from p in Post, as: :post, inner_lateral_join: c in lateral_assoc(p, :comments, :post)

which would get turned into this

from p in Post, as: :post,
inner_lateral_join: c in subquery(from c in Comments, 
where: c.post_id == parent_as(:post).id)

But we could not add a preload on it like this

from p in Post, as: :post,
inner_lateral_join: c in subquery(from c in Comments, 
where: c.post_id == parent_as(:post).id),
preload: [comments: c]

because we can't preload with subquery sources since we can't find the primary keys.

Do you think it is worth it to have a helper like this? I am on the fence. It is a nice abstraction over a potentially annoying subquery. But it is a decent downside not being able to preload.

greg-rychlewski avatar Jun 20 '24 21:06 greg-rychlewski

We can alternatively (but with the same downsides) keep using assoc but expand it to be assoc/3 and take in a parent binding. Then based on the join qualifier we raise or not if they didn't provide the parent binding. The joins with assoc/2 already don't work with the lateral join qualifiers because they are not wrapped in subqueries. So it might be a good way to round out that helper.

greg-rychlewski avatar Jun 20 '24 21:06 greg-rychlewski

I thought of one way to get a lateral assoc helper to work with preloads. We could tag the subquery as coming from these helpers. Then when extracting the source during the preload we know it's safe to use inner_query.from.source.

greg-rychlewski avatar Jun 20 '24 22:06 greg-rychlewski

I think there are two issues here.

The first one is about abstracting a common piece of code:

from p in Post, as: :post,
  inner_lateral_join: c in subquery(from c in Comments, 
  where: c.post_id == parent_as(:post).id)

into something smaller:

from p in Post, as: :post,
  inner_lateral_join: lateral_assoc(p, :comments, :post)

My concern is: what if I want to further modify the subquery? I think this would trap us.

The second one is allowing preloads from subqueries:

from p in Post, as: :post,
  inner_lateral_join: c in subquery(from c in Comments, 
  where: c.post_id == parent_as(:post).id),
  preload: [comments: c]

I think this is something we would like to support anyway, if we can.


However, I have to ask, are we gaining anything for doing the lateral joins? Joins are more expensive because they duplicate the "left side". We could achieve similar result using windows. Or what if we could support lateral_preload? That builds the preload internally as a separate query? Are these possible?

josevalim avatar Jun 21 '24 07:06 josevalim

For example, we could do this:

Repo.lateral_preload(posts, comments: Comment)

Then we build a query passing the post ids and we still use a lateral join, but we won't be duplicating the returned post rows.

Another idea is to augment preload to say, if a subquery is given, it will be treated as a lateral join (we can compile to windows in other adapters):

preload: [comments: subquery(Comment)]

If we find this too magical, we could introduce a lateral function:

preload: [comments: lateral(Comment)]

If we add such a new function, we could also use lateral in queries:

inner_join: c in lateral(from c in Comments, ...)
inner_join: c in lateral("SELECT * ...")

lateral would return a Ecto.Query.Lateral and it will either wrap a string (representing a fragment) or a query (or a subquery? 🤔 ). This way we could even deprecate inner_lateral, left_lateral, and cross_lateral in the future.

josevalim avatar Jun 21 '24 08:06 josevalim

Thanks for the response. I'll try to group my thoughts into big buckets

Is it worth it to support lateral join preloads

I believe it is. Even though the same end result can be obtained using window functions, the way the results are grabbed are quite different and lateral joins will be much more performant in some cases. This is the example I tend to refer to for this.

Duplicating the left side of the join

There will be cases where lateral join can be used without duplicating. For example if you join Company onto Movie and want to find the most recent movie per company. I also think it has parity with the other joins, where people can choose if the duplication is worth it on a case by case basis.

Preloading a subquery in a join

I actually tried this route first and where I got stuck is we are using primary keys to match the parent/child records. And we have no universal way to get the primary key for a subquery. It would have to be marked as a special case somehow, which is where I really like your idea to have something like from s in Source, inner_join: a in lateral(Assoc), preload: [assoc: a]. That will let us mark it as special and we will know how to extract the source from the subquery.

Whether we need a separate query lateral preload

I have to look at this closer. My initial thought is I don't know if we need it. I'm pretty sure the separate query preload is already flexible enough to let the user do this through a custom query. So then it's whether or not adding the shortcut is worth it. And I'm not too sure atm.

Conlusion

I would like to investigate using a new lateral function like you suggested for join queries only atm.

greg-rychlewski avatar Jun 21 '24 14:06 greg-rychlewski

Thanks for continuing to work on this.

My original intention to go the route with the custom assoc was to avoid having to rewrite every query in our app that uses joins the association backed with a view, which is quite slow.

But now I see that it's not really feasible, and I will look into rewriting them all using the join_lateral keyword we have today (not as simple as it sounds, since the queries are generated quite dynamically).

However I'm excited for the lateral preloads, this could also give us a nice performance boost in the same area.

narrowtux avatar Jun 21 '24 14:06 narrowtux

Thanks @greg-rychlewski. I understand that lateral joins may be valid in the same query, when they return a single result, but that's already possible today and then we need to ask if that's the pattern we should promote. I would rather promote something that will work better for most cases. :)


I actually tried this route first and where I got stuck is we are using primary keys to match the parent/child records. And we have no universal way to get the primary key for a subquery.

What if we support assoc(p, :comments, subquery)?

I'm pretty sure the separate query preload is already flexible enough to let the user do this through a custom query.

Unfortunately it is not, unless we use windows or something. It is possible with a function, if we use unnest(ids) and do a lateral join, which is what I would propose for us to encapsulate. We could even introduce a Ecto.Preloader protocol, so we could pass anything as the preload value, and it would invoke the protocol. So we could do:

preload(comments: any_value)

And we call the protocol with the any_value, the IDs, and the relation struct. And then anyone could implement preload(comments: lateral(...)).

josevalim avatar Jun 22 '24 07:06 josevalim

For the separate preload query, I think I'm probably not seeing something you are. The biggest blocker for me mentally is that the behaviour of lateral joins depends on it being part of the same query. For instance this query:

from p in Post, lateral_join: s in subquery(...)

It's essentially a for loop like this

for row in Post
  evaluate subquery with current values of row

I'm having a hard time picturing how we would do this with a separate query without it being a duplicate of the main query. Because it wouldn't be a single evaluation of a query with a list of ids. It would need to evaluate it once for each id and also need access to the other values in the row. I'm not sure if you had a particular example in mind?

greg-rychlewski avatar Jun 22 '24 13:06 greg-rychlewski

I was thinking you would write it as select * from unnest($1) as id left lateral join (select … from comments where c.post_id == id …). Wouldn’t that work?

josevalim avatar Jun 22 '24 14:06 josevalim

Oh unnest will only work on array types. So databases like MySQL won't be able to do that. I think I get what you mean though. Basically the preload query will still do the join but not send back the entire row duplicated. Only the primary key.

I'm not too sure if there is a consistent way to generate a table structure from a list of ids across all databases. We can maybe do the join on the parent itself selecting only its id: select parent.id, child.* from parent inner lateral join (select * from child where child.parent_id = parent_id) on true where parent.id in (...). This way we would also have access to the other fields in the parent that can be referenced by the subquery. But we have to be sure the planner won't do something weird like join everything and then filter the ids after. It also feels a bit weird to traverse the parent in the separate query because it's not free. So whether or not creating the extra query is worth it might be a bit more uncertain than a normal join.

I think if we do separate preload query for lateral we also want to enforce they are providing a subquery. Because let's say they just provide the association name or struct or something like that. That would possibly hurt their performance because there's no point to doing a lateral join unless you are doing some kind of windowing calculations. If there is nothing like that then you are essentially forcing the db to do a regular join in a nested loop fashion instead of letting it choose the best strategy.

greg-rychlewski avatar Jun 22 '24 15:06 greg-rychlewski

Agreed on everything you said. Perhaps the smart choice for now is to allow a 2-arity function in preloader that receives the IDs and the association metadata, so people can build their own lateral preloaders?

josevalim avatar Jun 22 '24 16:06 josevalim

That sounds good to me. Would you mind double checking I am understanding everything correctly:

Current Situation

  • Users cannot use a custom preload query because they don't have control over where the parent ids go. Ecto injects where: child.parent_id in parent_ids into the preload query automatically
  • Users can use a custom preload function, but it cannot be general because it's only passed the parent ids and not the information about the association.

Suggested Change

  • If we add a 2-arity custom preload function then we can pass the association information and users can have one function to handle all lateral preloads

And as an aside, would you still be interested in adding this for join preloads:

What if we support assoc(p, :comments, subquery)?

greg-rychlewski avatar Jun 22 '24 23:06 greg-rychlewski

If we add a 2-arity custom preload function then we can pass the association information and users can have one function to handle all lateral preloads

Yes!

And as an aside, would you still be interested in adding this for join preloads:

Yes!

josevalim avatar Jun 23 '24 08:06 josevalim