spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-39624][SQL] Support coalesce partition through CartesianProduct

Open lsm1 opened this issue 3 years ago • 3 comments

What changes were proposed in this pull request?

Coalesce paritition for every group

Why are the changes needed?

With CartesianProduct, CoalesceShufflePartitions can not optimize it.

Such as sql like this, if CoalesceShufflePartitions can not apply ,t1 join t2 will produce a lot partition, the result partition will be left partition * right partition which can be quite large.

SELECT * FROM ( SELECT * FROM t3) t3
    JOIN (
        SELECT t1.key, t2.value FROM
         ( SELECT * FROM t1) t1
            JOIN
         ( SELECT * FROM t2) t2
            ON t1.value = t2.value
    ) t ON t3.key = t.key OR t3.value = t.value

It's better to support partial optimize with CartesianProductExec.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add test.

lsm1 avatar Jun 28 '22 10:06 lsm1

Can one of the admins verify this patch?

AmplabJenkins avatar Jun 28 '22 17:06 AmplabJenkins

Can you please rework your PR description, by stating what's the behavior before and after your change? That screenshot doesn't say anything here. You can remove it. Maybe use some before/after examples instead.

maryannxue avatar Jul 05 '22 15:07 maryannxue

+1 with @maryannxue comment. Otherwise the idea looks good, also cc @cloud-fan

ulysses-you avatar Jul 08 '22 01:07 ulysses-you

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Nov 26 '22 00:11 github-actions[bot]