citus WIP Fix issue #7674 about UPDATE SET(..), with indirection

Issue reported to Data Bene support. See #7674

The fix looks good, at least it breaks nothing and the bug not visible.

I am unsure to add all regressions tests where I did, maybe their own file?

Also the PR is not yet ready regarding citus_indent, but if you can already provide some feedback.

Aug 23 '24 14:08 c2main

@naisila just a message to help you find this PR when you'll get a chance to look at it...

Nov 19 '24 20:11 c2main

#5692 is also trying to fix quite a similar problem. Reviewers had objected to that PR because we want to avoid modifying ruleutils as much as possible and try to fix such issues through the distributed planner. Have you tried to investigate this issue through the lens of the distributed planner hook?

Thank you for the feedback. No I didn't look at distributed planner hook, I will evaluate this approach.

Nov 21 '24 09:11 c2main

#5692 is also trying to fix quite a similar problem. Reviewers had objected to that PR because we want to avoid modifying ruleutils as much as possible and try to fix such issues through the distributed planner. Have you tried to investigate this issue through the lens of the distributed planner hook?

Thank you for the feedback. No I didn't look at distributed planner hook, I will evaluate this approach.

I just looked:

apparently by doing at this level, we'll have to test sooner for the conditions, thus impacting all queries (a bit like HasUnresolvedExternParamsWalker() is doing).
And It's not clear to me yet if patching just a single "planner" will be enough...

I understand the motivation to not touch too much ruleutils, but also my understanding is that ruleutils has been imported precisely to achieve a goal which is not achieved because it contains several limitation due to reasons well explained in Onur's PR (tree rewritten and optimized by postgres before being in the hands of citus).

By touching only distributed planner I fear that the bug will just reappear in some corner case situation, but I am still exploring, sharing here as I'm sure those points have been discussed in the past but I didn't search history...

Nov 21 '24 12:11 c2main

pg_update_query_def..() functions are only internal functions to ruleutils, the entry point for citus is pg_get_query_def. And this last one is handling cases and all up to the function I patched.

If the patch is elsewhere then I suppose it will be required to do essentially what pg_query_def() is doing and more or less extract this code (and there is more than one function involved here) to citus_ruleutils.c or similar. Even if writing a very precise walker I think there is also a performance impact which might appear if we need to control on each iteration, or have the code do its own recursion.

IMHO, this is a "codepath" expecting to diverge a lot from upstream because they deserve distinct purpose, for many cases the solution is just to not allow that in Citus and maybe it's an alternative to consider here. Like some other similar queries.

Maybe I missed alternatives ?

Nov 21 '24 18:11 c2main

Thanks for your investigations. The team will look at this in more detail as soon as we get a chance.

Nov 22 '24 07:11 naisila

Thanks for your investigations. The team will look at this in more detail as soon as we get a chance.

I've tried to patch in lower layer, it's possible, but less convenient and more error prone. So I've pushed up in distribute_planner hook function, only when needsDistributedPlanning is true.

If it works for you this way I can squash the commits and maybe improve names/comments ?

Nov 22 '24 21:11 klando

Thanks for your investigations. The team will look at this in more detail as soon as we get a chance.

I've tried to patch in lower layer, it's possible, but less convenient and more error prone. So I've pushed up in distribute_planner hook function, only when needsDistributedPlanning is true.

If it works for you this way I can squash the commits and maybe improve names/comments ?

@naisila @onurctirtir is it aligned with what you had in mind ?

Dec 02 '24 10:12 c2main

Thanks for your investigations. The team will look at this in more detail as soon as we get a chance.

I've tried to patch in lower layer, it's possible, but less convenient and more error prone. So I've pushed up in distribute_planner hook function, only when needsDistributedPlanning is true. If it works for you this way I can squash the commits and maybe improve names/comments ?

@naisila @onurctirtir is it aligned with what you had in mind ?

This PR has been open for several month already, it fixes a transparent bug with data alteration. What is required to increase priority or getting some more attention ?

Dec 13 '24 06:12 c2main

@c2main Thanks for your work on this, we appreciate your reworking the fix to apply it in the distributed planner hook. We would like the fix to be detected and applied at a lower level than in the current PR, for example in CreateModifyPlan() (or at a lower level, in RouterJob()). If this proves to be infeasible then we'd prefer to go with the first approach of applying the fix in citus_ruleutils.c. Thanks again for your efforts, we can work with you in progressing this.

Dec 20 '24 12:12 colm-mchugh

@c2main Thanks for your work on this, we appreciate your reworking the fix to apply it in the distributed planner hook. We would like the fix to be detected and applied at a lower level than in the current PR, for example in CreateModifyPlan() (or at a lower level, in RouterJob()). If this proves to be infeasible then we'd prefer to go with the first approach of applying the fix in citus_ruleutils.c. Thanks again for your efforts, we can work with you in progressing this.

When patching in CreateModifyPlan() the following query on a distributed table is failing (only this one), apparently it does not enter this function at all (I've not checked why):

with qq3 as (
    update test_dist_indirection
    SET (col_text, col_bool)
      = (SELECT 'full', true)
    where id = 3
    returning *
),
qq4 as (
    update test_dist_indirection
    SET (col_text, col_bool)
      = (SELECT 'fully', true)
    where id = 4
    returning *
)
select * from qq3 union all select * from qq4;

However, adding my rewrite function just before PlanRouterQuery() in RouterJob() does the job.

Maybe it's possible to push down a bit more in PlanRouterQuery() but then there are those if/else cases to manage precisely:

if (fastPathRouterQuery)
...
else
...
if (isMultiShardQuery)
...
else
...

Is it worth pursuing in this direction ?

Meantime, I have update PR with the Routerjob() patch.

Dec 23 '24 15:12 klando

Codecov Report

Attention: Patch coverage is 20.58824% with 54 lines in your changes missing coverage. Please review.

Project coverage is 48.47%. Comparing base (0e6127c) to head (c5e1c1e). Report is 34 commits behind head on main.

:x: Your patch check has failed because the patch coverage (20.58%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage. :x: Your project check has failed because the head coverage (48.47%) is below the target coverage (87.50%). You can increase the head coverage or adjust the target coverage.

:exclamation: There is a different number of reports uploaded between BASE (0e6127c) and HEAD (c5e1c1e). Click for more details.

HEAD has 81 uploads less than BASE

Flag BASE (0e6127c) HEAD (c5e1c1e)

15_regress_check-pytest 1 0

16_regress_check-pytest 1 0

15_citus_upgrade 1 0

16_17_upgrade 1 0

15_16_upgrade 1 0

17_regress_check-pytest 1 0

16_regress_check-columnar-isolation 1 0

15_regress_check-follower-cluster 1 0

15_regress_check-columnar-isolation 1 0

17_regress_check-follower-cluster 1 0

16_regress_check-follower-cluster 1 0

17_regress_check-columnar 1 0

15_regress_check-columnar 1 0

15_regress_check-enterprise-isolation-logicalrep-3 1 0

15_17_upgrade 1 0

17_regress_check-columnar-isolation 1 0

16_regress_check-enterprise-isolation-logicalrep-3 1 0

16_regress_check-columnar 1 0

17_regress_check-enterprise-isolation-logicalrep-3 1 0

17_regress_check-enterprise-isolation-logicalrep-2 1 0

16_regress_check-enterprise-isolation-logicalrep-2 1 0

17_regress_check-enterprise-failure 1 0

15_regress_check-enterprise-failure 1 0

17_regress_check-split 1 0

15_regress_check-split 1 0

17_regress_check-query-generator 1 0

16_regress_check-enterprise-failure 1 0

17_regress_check-multi-mx 1 0

17_regress_check-enterprise-isolation-logicalrep-1 1 0

17_regress_check-failure 1 0

15_regress_check-failure 1 0

15_regress_check-multi-mx 1 0

16_regress_check-multi-mx 1 0

16_cdc_installcheck 1 0

17_cdc_installcheck 1 0

15_cdc_installcheck 1 0

17_arbitrary_configs_3 1 0

16_arbitrary_configs_5 1 0

16_regress_check-operations 1 0

17_regress_check-operations 1 0

17_arbitrary_configs_2 1 0

16_regress_check-isolation 1 0

16_regress_check-split 1 0

16_regress_check-query-generator 1 0

15_regress_check-enterprise-isolation-logicalrep-2 1 0

17_regress_check-vanilla 1 0

15_regress_check-vanilla 1 0

16_regress_check-enterprise 1 0

15_regress_check-query-generator 1 0

16_regress_check-vanilla 1 0

17_regress_check-enterprise 1 0

16_arbitrary_configs_3 1 0

15_regress_check-enterprise 1 0

17_regress_check-enterprise-isolation 1 0

16_regress_check-enterprise-isolation-logicalrep-1 1 0

16_regress_check-enterprise-isolation 1 0

15_regress_check-enterprise-isolation-logicalrep-1 1 0

15_regress_check-isolation 1 0

17_regress_check-isolation 1 0

16_arbitrary_configs_4 1 0

16_regress_check-multi 1 0

15_regress_check-multi 1 0

17_arbitrary_configs_4 1 0

17_regress_check-multi 1 0

17_arbitrary_configs_0 1 0

17_arbitrary_configs_5 1 0

15_arbitrary_configs_3 1 0

16_regress_check-multi-1 1 0

15_arbitrary_configs_5 1 0

15_regress_check-multi-1 1 0

17_regress_check-multi-1 1 0

16_arbitrary_configs_2 1 0

15_arbitrary_configs_2 1 0

15_arbitrary_configs_4 1 0

16_arbitrary_configs_0 1 0

15_arbitrary_configs_0 1 0

15_arbitrary_configs_1 1 0

16_arbitrary_configs_1 1 0

16_regress_check-failure 1 0

15_regress_check-operations 1 0

15_regress_check-enterprise-isolation 1 0

Flag	BASE (0e6127c)	HEAD (c5e1c1e)
15_regress_check-pytest	1	0
16_regress_check-pytest	1	0
15_citus_upgrade	1	0
16_17_upgrade	1	0
15_16_upgrade	1	0
17_regress_check-pytest	1	0
16_regress_check-columnar-isolation	1	0
15_regress_check-follower-cluster	1	0
15_regress_check-columnar-isolation	1	0
17_regress_check-follower-cluster	1	0
16_regress_check-follower-cluster	1	0
17_regress_check-columnar	1	0
15_regress_check-columnar	1	0
15_regress_check-enterprise-isolation-logicalrep-3	1	0
15_17_upgrade	1	0
17_regress_check-columnar-isolation	1	0
16_regress_check-enterprise-isolation-logicalrep-3	1	0
16_regress_check-columnar	1	0
17_regress_check-enterprise-isolation-logicalrep-3	1	0
17_regress_check-enterprise-isolation-logicalrep-2	1	0
16_regress_check-enterprise-isolation-logicalrep-2	1	0
17_regress_check-enterprise-failure	1	0
15_regress_check-enterprise-failure	1	0
17_regress_check-split	1	0
15_regress_check-split	1	0
17_regress_check-query-generator	1	0
16_regress_check-enterprise-failure	1	0
17_regress_check-multi-mx	1	0
17_regress_check-enterprise-isolation-logicalrep-1	1	0
17_regress_check-failure	1	0
15_regress_check-failure	1	0
15_regress_check-multi-mx	1	0
16_regress_check-multi-mx	1	0
16_cdc_installcheck	1	0
17_cdc_installcheck	1	0
15_cdc_installcheck	1	0
17_arbitrary_configs_3	1	0
16_arbitrary_configs_5	1	0
16_regress_check-operations	1	0
17_regress_check-operations	1	0
17_arbitrary_configs_2	1	0
16_regress_check-isolation	1	0
16_regress_check-split	1	0
16_regress_check-query-generator	1	0
15_regress_check-enterprise-isolation-logicalrep-2	1	0
17_regress_check-vanilla	1	0
15_regress_check-vanilla	1	0
16_regress_check-enterprise	1	0
15_regress_check-query-generator	1	0
16_regress_check-vanilla	1	0
17_regress_check-enterprise	1	0
16_arbitrary_configs_3	1	0
15_regress_check-enterprise	1	0
17_regress_check-enterprise-isolation	1	0
16_regress_check-enterprise-isolation-logicalrep-1	1	0
16_regress_check-enterprise-isolation	1	0
15_regress_check-enterprise-isolation-logicalrep-1	1	0
15_regress_check-isolation	1	0
17_regress_check-isolation	1	0
16_arbitrary_configs_4	1	0
16_regress_check-multi	1	0
15_regress_check-multi	1	0
17_arbitrary_configs_4	1	0
17_regress_check-multi	1	0
17_arbitrary_configs_0	1	0
17_arbitrary_configs_5	1	0
15_arbitrary_configs_3	1	0
16_regress_check-multi-1	1	0
15_arbitrary_configs_5	1	0
15_regress_check-multi-1	1	0
17_regress_check-multi-1	1	0
16_arbitrary_configs_2	1	0
15_arbitrary_configs_2	1	0
15_arbitrary_configs_4	1	0
16_arbitrary_configs_0	1	0
15_arbitrary_configs_0	1	0
15_arbitrary_configs_1	1	0
16_arbitrary_configs_1	1	0
16_regress_check-failure	1	0
15_regress_check-operations	1	0
15_regress_check-enterprise-isolation	1	0

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #7675       +/-   ##
===========================================
- Coverage   89.19%   48.47%   -40.73%     
===========================================
  Files         283      284        +1     
  Lines       61050    60674      -376     
  Branches     7626     7489      -137     
===========================================
- Hits        54453    29410    -25043     
- Misses       4412    28606    +24194     
- Partials     2185     2658      +473

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Dec 23 '24 15:12 codecov[bot]

However, adding my rewrite function just before PlanRouterQuery() in RouterJob() does the job.

Nice.

Maybe it's possible to push down a bit more in PlanRouterQuery() but then there are those if/else cases to manage precisely:
if (fastPathRouterQuery)
...
else
...
if (isMultiShardQuery)
...
else
...
Is it worth pursuing in this direction ?

Yes, but if it proves too cumbersome because of the different query types it's probably not worth pursuing. For example, if special case code is required for each query type and PlanRouterQuery() gets cluttered.. but please investigate at least, we'll leave it to your judgement if its worth adjusting the PR as a consequence.

Dec 24 '24 01:12 colm-mchugh

However, adding my rewrite function just before PlanRouterQuery() in RouterJob() does the job.

Nice.
Maybe it's possible to push down a bit more in PlanRouterQuery() but then there are those if/else cases to manage precisely:
if (fastPathRouterQuery)
...
else
...
if (isMultiShardQuery)
...
else
...
Is it worth pursuing in this direction ?
Yes, but if it proves too cumbersome because of the different query types it's probably not worth pursuing. For example, if special case code is required for each query type and PlanRouterQuery() gets cluttered.. but please investigate at least, we'll leave it to your judgement if its worth adjusting the PR as a consequence.

Pushing down a little bit more, 2 call places are required:

diff --git a/src/backend/distributed/planner/multi_router_planner.c b/src/backend/distributed/planner/multi_router_planner.c
index 84352415f..36aaeeecd 100644
--- a/src/backend/distributed/planner/multi_router_planner.c
+++ b/src/backend/distributed/planner/multi_router_planner.c
@@ -1869,13 +1869,6 @@ RouterJob(Query *originalQuery, PlannerRestrictionContext *plannerRestrictionCon
        }
        else
        {
-               /*
-                * We may need to reorder parts of the planner tree we are receiving here.
-                * We expect to produce an SQL query text but our tree has been optimized by
-                * PostgreSL rewriter already...
-                * FIXME is there conditions to reduce the number of calls ?
-                */
-               RebuildParserTreeFromPlannerTree(originalQuery);
                (*planningError) = PlanRouterQuery(originalQuery, plannerRestrictionContext,
                                                                                   &placementList, &shardId, &relationShardList,
                                                                                   &prunedShardIntervalListList,
@@ -2364,6 +2357,8 @@ PlanRouterQuery(Query *originalQuery,
 
                if (!IsMergeQuery(originalQuery))
                {
+                       /* OK */
+                       RebuildParserTreeFromPlannerTree(originalQuery);
                        planningError = ModifyQuerySupported(originalQuery, originalQuery,
                                                                                                 isMultiShardQuery,
                                                                                                 plannerRestrictionContext);
@@ -2450,6 +2445,8 @@ PlanRouterQuery(Query *originalQuery,
        bool isUpdateOrDelete = UpdateOrDeleteOrMergeQuery(originalQuery);
        if (!(isUpdateOrDelete && RequiresCoordinatorEvaluation(originalQuery)))
        {
+               /* OK */
+               RebuildParserTreeFromPlannerTree(originalQuery);
                UpdateRelationToShardNames((Node *) originalQuery, *relationShardList);
        }

the first call before ModifyQuerySupported() is not "enough".
it's counterintuitive (for me) to have to call before UpdateRelationToShardNames() (I tried to push down into this one but it leads to more calls to RebuildParserTreeFromPlannerTree()). I was wondering if maybe some function names were misleading or just facing a side effect, both case it looks a bit wrong to go deeper into this direction: eventually we'll get back to ruleutils to only be called when appropriate.

At the moment, the RebuildParserTreeFromPlannerTree can be called several times, the most complex being the failing query mentioned earlier (with double CTE on UPDATE, followed by a SELECT on the CTEs), it outlines that each CTE + the SELECT are handled (3 calls to rewrite completely the query).

So what are we trying to solve here ?

less calls to rewrite: have to be done very early on a larger query tree
less calls to rewrite: have to be done very late only when rewriting query with ruleutils and actively entering the path of interest.
in-between: try to figure where there are few calls, and not called when not required, and not recalled for nothing (while pushing down, I also got cases with 5 "rewrites" instead of the current 3 for the "complex query").

There is very probably a design choice to be made, if possible not by me but Citus committers on what contract Citus want to have with parser tree vs rewritter tree vs planner tree:

fix ruleutils as we go to circumvent the side effects of the rewriter on the planner tree received
assume the planner tree is not good for all situation, and fix early in the process in order to reverse planner tree as near as possible to the parser tree (before "rewrite").
patch here and there and assume associated maintenance cost.

My understanding is that the current contract is the 3rd (which include a bit of contract 1 and contract 2). My preferences in design is 2. (it may allow to handle more queries in Citus), then the 1. for workarounds (and evaluate fixes upstream).

From a PostgreSQL point of view, we also have problem around this topic (to show VIEW definition IIRC, and for ANALYZE text output). Having the parser tree available would be very handy. The other way around is maybe to push-back ruleutils fixes to PostgreSQL, if at all possible, to reduce the diff with upstream.

Dec 24 '24 11:12 c2main

There is very probably a design choice to be made, if possible not by me but Citus committers on what contract Citus want to have with parser tree vs rewritter tree vs planner tree:

fix ruleutils as we go to circumvent the side effects of the rewriter on the planner tree received

assume the planner tree is not good for all situation, and fix early in the process in order to reverse planner tree as near as possible to the parser tree (before "rewrite").

patch here and there and assume associated maintenance cost.

Nicely put, thanks for laying out the options. We are currently weighing up 1 and 2. There may be unintended consequences to 2 (and 3) because Citus is passing a partially un-written Query tree to standard_planner() and the Postgres planner assumes that the given query has been produced by the parser and rewriter (per the comment on subquery_planner() in planner.c). The hazard (or not) of doing this is something I want to look into, but if you have any thoughts please share.

You mentioned the performance impact of 2 in some earlier comments, perhaps this is something that could be assessed with benchmarking. As with any benchmarking coming up with a set of queries that push the feature (or patch) is the main challenge.

There is potentially another way to achieve 2, which is to use Postgres' post_parse_analyze_hook. This could give Citus the opportunity to capture the pre-rewrite query (or aspects of it, such as the target list of an UPDATE query) and then use that information as needed. It could reduce the need for the un-rewrite code here (and also #5692). But defining a new hook for Citus is probably beyond the scope of this PR, and should be done wholistically.

I verified that your PR fixes #4092, a similar issue. Could you update the test cases to cover that ? Also, it would be great if this PR could assess and include the logic and test cases from #5692, as that is also fixing a problematic UPDATE statement (postgres rewrite merges UPDATE targets into one single target). What's your take on including #5692 ?

Jan 09 '25 22:01 colm-mchugh

There is very probably a design choice to be made, if possible not by me but Citus committers on what contract Citus want to have with parser tree vs rewritter tree vs planner tree:

fix ruleutils as we go to circumvent the side effects of the rewriter on the planner tree received

assume the planner tree is not good for all situation, and fix early in the process in order to reverse planner tree as near as possible to the parser tree (before "rewrite").

patch here and there and assume associated maintenance cost.

Nicely put, thanks for laying out the options. We are currently weighing up 1 and 2. There may be unintended consequences to 2 (and 3) because Citus is passing a partially un-written Query tree to standard_planner() and the Postgres planner assumes that the given query has been produced by the parser and rewriter (per the comment on subquery_planner() in planner.c). The hazard (or not) of doing this is something I want to look into, but if you have any thoughts please share.

You mentioned the performance impact of 2 in some earlier comments, perhaps this is something that could be assessed with benchmarking. As with any benchmarking coming up with a set of queries that push the feature (or patch) is the main challenge.

There is potentially another way to achieve 2, which is to use Postgres' post_parse_analyze_hook. This could give Citus the opportunity to capture the pre-rewrite query (or aspects of it, such as the target list of an UPDATE query) and then use that information as needed. It could reduce the need for the un-rewrite code here (and also #5692). But defining a new hook for Citus is probably beyond the scope of this PR, and should be done wholistically.

I verified that your PR fixes #4092, a similar issue. Could you update the test cases to cover that ? Also, it would be great if this PR could assess and include the logic and test cases from #5692, as that is also fixing a problematic UPDATE statement (postgres rewrite merges UPDATE targets into one single target). What's your take on including #5692 ?

Thank you for the feedback @colm-mchugh.

FYI I am still studying alternative ways to achieve the goal, including hook mentioned. Also a step back to look at the picture more globally (Postgres FDW does not support those statements because of the hard work required to undo those "optimizations" from the rewriter...so it looks interesting to have a patch for Postgres FDW and have similar for Citus, if at all possible).

Jan 15 '25 09:01 c2main

Thank you for the feedback @colm-mchugh.

FYI I am still studying alternative ways to achieve the goal, including hook mentioned. Also a step back to look at the picture more globally (Postgres FDW does not support those statements because of the hard work required to undo those "optimizations" from the rewriter...so it looks interesting to have a patch for Postgres FDW and have similar for Citus, if at all possible).

Sure, thanks for the update. We are leaning to applying the fix in ruleutils, or at least avoiding passing an un-rewritten query tree to the Postgres planner in the distributed planner hook. The targetlist un-rewriting could inhibit Postres' use of the physical tlist optimization when creating a scan node, which is not a concern for Citus distributed planning, but there could be potential problems in the future; we'd like to keep to Postgres planner's implicit expectation to receive a rewritten query tree. But please continue to investigate and propose the solution according to your judgement.

Postgres FDW is interesting, it does not include ruleutils but its deparse.c is “annoyingly duplicative of ruleutils.c”, to quote its comment. This duplication seems to be in some functions for deparsing expressions (deparseConst(), deparseFuncExpr(), deparseAggreg()). Of the extensions that come with Postgres, only one (pageinspect) uses ruleutils (specifically, function pg_get_indexdef_extended()) so Citus seems somewhat isolated in the extent of its dependency on ruleutils.

Jan 15 '25 16:01 colm-mchugh

Thank you for the feedback @colm-mchugh. FYI I am still studying alternative ways to achieve the goal, including hook mentioned. Also a step back to look at the picture more globally (Postgres FDW does not support those statements because of the hard work required to undo those "optimizations" from the rewriter...so it looks interesting to have a patch for Postgres FDW and have similar for Citus, if at all possible).

Sure, thanks for the update. We are leaning to applying the fix in ruleutils, or at least avoiding passing an un-rewritten query tree to the Postgres planner in the distributed planner hook. The targetlist un-rewriting could inhibit Postres' use of the physical tlist optimization when creating a scan node, which is not a concern for Citus distributed planning, but there could be potential problems in the future; we'd like to keep to Postgres planner's implicit expectation to receive a rewritten query tree. But please continue to investigate and propose the solution according to your judgement.

I completely agree with the points you're raising here and I was concerned by this angle too (and I confirm planner does expect the ordered tlist, at least from comment in rewriteHandler.c - IIRC).

Postgres FDW is interesting, it does not include ruleutils but its deparse.c is “annoyingly duplicative of ruleutils.c”, to quote its comment. This duplication seems to be in some functions for deparsing expressions (deparseConst(), deparseFuncExpr(), deparseAggreg()). Of the extensions that come with Postgres, only one (pageinspect) uses ruleutils (specifically, function pg_get_indexdef_extended()) so Citus seems somewhat isolated in the extent of its dependency on ruleutils.

Yeah, I appreciated the preliminary comment in this deparse.c file. You saw this one which looks a good description of the problem:

				/*
				 * If it's a MULTIEXPR Param, punt.  We can't tell from here
				 * whether the referenced sublink/subplan contains any remote
				 * Vars; if it does, handling that is too complicated to
				 * consider supporting at present.  Fortunately, MULTIEXPR
				 * Params are not reduced to plain PARAM_EXEC until the end of
				 * planning, so we can easily detect this case.  (Normal
				 * PARAM_EXEC Params are safe to ship because their values
				 * come from somewhere else in the plan tree; but a MULTIEXPR
				 * references a sub-select elsewhere in the same targetlist,
				 * so we'd be on the hook to evaluate it somehow if we wanted
				 * to handle such cases as direct foreign updates.)
				 */

I also have a better understanding on the motivation of the citus team to try to use another place than ruleutils to patch. On my journey, I was (and am still) also considering barriers to be added to just prevent the known bad cases to happen (for example it looks easy, safe and free to check if the tlist has been modified and reordered, and if it is ERROR with an informative message). Maybe this is the first thing to do as it'll hopefully requires a less intrusive approach. And we can get more time to think about it.

Jan 15 '25 17:01 c2main

Yeah, I appreciated the preliminary comment in this deparse.c file. You saw this one which looks a good description of the problem

I also have a better understanding on the motivation of the citus team to try to use another place than ruleutils to patch. On my journey, I was (and am still) also considering barriers to be added to just prevent the known bad cases to happen (for example it looks easy, safe and free to check if the tlist has been modified and reordered, and if it is ERROR with an informative message). Maybe this is the first thing to do as it'll hopefully requires a less intrusive approach. And we can get more time to think about it.

That comment in deparse.c on MULTI_EXPR param nodes reminds me of #7676, the other UPDATE issue you found, where the target expression passed to Postgres execution initialization has a MULTI_EXPR param node, and Postgres exec says uh, don't know what to do with this. Citus should put expressions through a setrefs.c like pass before handing off to the Postgres executor, but we haven't yet put time into figuring out how that should be done.

Returning to this issue, we (Citus committers) are internally agreed that rulesutils is the best place, or most appropriate place given the constraints, for the fix. However you make a good point about preventing the known problematic cases as an interim; it is less intrusive, and lower risk, and affords more time to land the solution, as you say. Please explore, and propose in this direction as you see fit.

Jan 17 '25 22:01 colm-mchugh

I have added more tested related to this PR and found another problem in insert select planner, fixed in https://github.com/citusdata/citus/pull/7912

Feb 27 '25 05:02 c2main

just pushed some more tests (see indirections.sql) and ERROR on cases not well managed so far.

Feb 27 '25 09:02 c2main

OK, I have added a list sort function to order target list per paramid value: those are always incremented, with a large increment when switching sublinks, it looks simple enough, though I may rework the need_to_sort_target_list.

As such it's fixing all the related issues I think, but the one with a now() call:

+update test_ref_multiexpr
SET (col_timestamp)
  = (SELECT now())
returning id, col_int, col_bool;
DEBUG:  Creating router plan
ERROR:  unrecognized paramkind: 3

I will update this PR accordingly, at least with an identified ERROR message for this case, and will fix in another PR (related to this other issue: https://github.com/citusdata/citus/issues/7676 )

Mar 02 '25 19:03 c2main

@c2main we'd like get this into the upcoming release Citus 13.1, so just want to check if it is possible for you to rebase onto the current main branch, taking into account the following details, within the next couple of weeks, or by the end of this month (Apr '25):

Fixes for #7676 and #7912 are in Citus main branch (thanks!).
Supported PG versions are 17, 16, 15
Can the logic be moved into citus_ruleutils.c and called from each supported PG's ruleutils_xx.c? For example citus_ruleutils.c defines and exports a function that checks if the targetlist is in paramid order and sorts it if not, and this function is called by get_update_query_targetlist_def(), maybe after the source sublinks have been collected. It looks like ruleutils_15.c errors out instead of re-ordering by paramid? But can all supported PG versions use the re-order when necessary approach ?

Let us know how that sounds, if its reasonable to target it for 13.1. And thanks again for your efforts.

Apr 15 '25 15:04 colm-mchugh

@c2main we'd like get this into the upcoming release Citus 13.1, so just want to check if it is possible for you to rebase onto the current main branch, taking into account the following details, within the next couple of weeks, or by the end of this month (Apr '25):
* Fixes for [#7676](https://github.com/citusdata/citus/issues/7676) and [#7912](https://github.com/citusdata/citus/pulls/7912) are in Citus main branch (thanks!).

* Supported PG versions are 17, 16, 15

* Can the logic be moved into `citus_ruleutils.c` and called from each supported PG's ruleutils_xx.c? For example `citus_ruleutils.c` defines and exports a function that checks if the targetlist is in paramid order and sorts it if not, and this function is called by `get_update_query_targetlist_def()`, maybe after the source sublinks have been collected. It looks like `ruleutils_15.c` errors out instead of re-ordering by paramid? But can all supported PG versions use the re-order when necessary approach ?
Let us know how that sounds, if its reasonable to target it for 13.1. And thanks again for your efforts.

:+1:

That sounds good, there is also this point that we need to WARN/ERROR or have an option to WARN/ERROR when we detect a query which may have corrupt data previously.

Apr 16 '25 03:04 c2main

That sounds good, there is also this point that we need to WARN/ERROR or have an option to WARN/ERROR when we detect a query which may have corrupt data previously.

Yes, I will discuss with the citus committers and get back on this specific point. In the meantime, I'll leave some comments on the PR.

Apr 16 '25 16:04 colm-mchugh

That sounds good, there is also this point that we need to WARN/ERROR or have an option to WARN/ERROR when we detect a query which may have corrupt data previously.

So after discussing this with the team the preferred approach is to highlight the issue in the release notes and changelog. This is how such issues have been dealt with in the past. Do you have a suggestion/thoughts as to how the condition can be detected ?

Apr 17 '25 14:04 colm-mchugh

That sounds good, there is also this point that we need to WARN/ERROR or have an option to WARN/ERROR when we detect a query which may have corrupt data previously.

So after discussing this with the team the preferred approach is to highlight the issue in the release notes and changelog. This is how such issues have been dealt with in the past. Do you have a suggestion/thoughts as to how the condition can be detected ?

How are identified such issues in the release notes ? I search for corruption in https://github.com/citusdata/citus/blob/main/CHANGELOG.md but didn't find anything.

Honestly this is not super user friendly behavior to keep that on a minor release messages only and it will probably be complex for users to evaluate if they are concerned or not.

Apr 24 '25 09:04 c2main

LGTM, thank you @colm-mchugh for the updates! It's clean.

It'll be super nice if we can still have room to add the GUC I proposed to get log messages of non-stable behavior between releases (at least minors!)

Sure, thanks again @c2main for your efforts on this. I hear you on the warning behavior; will take another look for feasibility, and look at Postgres commit history and other extensions for any precedents.

Jul 18 '25 11:07 colm-mchugh

As long as we incorporate the tests from and the broken cases noted in #5692, I'm okay with merging this.

Yes, the new regress test multi_update_select.sql has test cases that were identified in #5692, in particular test both router (fast path) and push down plans on a distributed table.

Also, does that also fix #5621? If so, we can also mention this in the PR description to automatically close that issue as well:

Closes #5621.

Actually no, we didn't get #5621 in. Let's merge that separately ?

Also, let's add a line that starts with DESCRIPTION: Fixes .. to the PR description so that this appears in the changelog.

Done.

Jul 18 '25 11:07 colm-mchugh

As long as we incorporate the tests from and the broken cases noted in #5692, I'm okay with merging this.

Yes, the new regress test multi_update_select.sql has test cases that were identified in #5692, in particular test both router (fast path) and push down plans on a distributed table.

Also, does that also fix #5621? If so, we can also mention this in the PR description to automatically close that issue as well:

Closes #5621.

~~> Actually no, we didn't get #5621 in. Let's merge that separately ?~~

@onurctirtir this PR has been updated to fix #5621 - incorporating your fix #5692 makes sense because it impacts the same codepath - can you take a look to check if it looks sane?

Also, let's add a line that starts with DESCRIPTION: Fixes .. to the PR description so that this appears in the changelog.

Done.

Jul 21 '25 14:07 colm-mchugh