sql icon indicating copy to clipboard operation
sql copied to clipboard

[RFC] Support JOIN command in PPL

Open LantaoJin opened this issue 1 year ago • 8 comments

Is your feature request related to a problem? Current JOIN syntax is only available in SQL query due to the implementation of JOIN is in v1 engine only. To support JOIN command in PPL, we should implement it in v2 engine. One proposal is reusing legacy JOIN implementation in v2 engine, but the legacy JOIN implementation lacks scalability and hardly to enhance, https://github.com/opensearch-project/sql/issues/49 listed many requirements of enhancement.

What solution would you like? To mitigate the impacts of customer SQL experience, an ideal process to support JOIN will be as following:

  • [ ] 1 Re-implement the JOIN operator in SQL v2 engine and provide a the JOIN command in PPL only.
  • [ ] #2920
  • [ ] #2921
  • [ ] #2922
  • [ ] #2923
  • [ ] #2924
  • [ ] #2919
  • [ ] #2925
  • [ ] 3 All enhancement work should base on v2 implementation except bug fixes.

PPL Join Syntax Options: (Updated at Sep 14)

~~#### Option 1: Join clause as a source extension existing in Search command and should be at the beginning of PPL.~~

SEARCH source=<left-table> AS <leftAlias> [joinType] JOIN <right-table> AS <rightAlias> ON joinCriteria 
| <other piped command>
  • Join clause is not a new PPL command, it's a source extension.
  • Since Join clause is a part of source (search command), it must fill at the beginning of ppl command.
  • Only tableSource could be used in Join sides.
  • The table aliases can be used directly.
  • Join reorder optimization could be applied in future
  • This idea is from this paper

Here is an example: Original SQL (TPC-H Q13 with table alias):

SELECT c_count, COUNT(*) AS custdist
FROM
  ( SELECT c.c_custkey, COUNT(o.o_orderkey) c_count
    FROM customer AS c
    LEFT OUTER JOIN orders AS o ON c.c_custkey = o.o_custkey
         AND o.o_comment NOT LIKE '%unusual%packages%'
    GROUP BY c.c_custkey
  ) AS c_orders
GROUP BY c_count
ORDER BY custdist DESC, c_count DESC;

The rewriting PPL of Option 1:

SEARCH source=customer AS c LEFT OUTER JOIN orders AS o ON c.c_custkey = o.o_custkey AND o.o_comment NOT LIKE '%unusual%packages%'
| STATS count(o.o_orderkey) AS c_count BY c.c_custkey
| STATS count(*) AS custdist BY c_count
| SORT - custdist, - c_count

(Winner) Option 2: Join clause as a PPL command could be existing anywhere.

In Option 2, Join sides could be tableSource:

SEARCH source=<left-table>
| <other piped command>
| [joinType] JOIN hint.left = <leftAlias>, hint.right = <rightAlias> ON joinCriteria <right-table>
| <other piped command>

Also, in Option 2, Join sides can be sub-search (sub-search itself unsupported yet)

SEARCH source=<left-table>
| <other piped command>
| [joinType] JOIN hint.left = <leftAlias>, hint.right = <rightAlias> ON joinCriteria
  [ 
    SEARCH source=<right-table>
    | <other piped command>
  ]
| <other piped command>
  • Join clause is a new PPL command
  • Join command should start with | and exist anywhere.
  • Not only support tableSource but also support sub-search in Join sides.
  • The table aliases cannot be used directly. Aliases must be assigned in specific syntax (here we reuse hint syntax).
  • Hardly to apply Join reorder optimization if more than two sub-searches.
  • The syntax will be looks similar to SPL2 Join. Note: not SPL1 Join

The rewriting PPL of Option 2 for previous example:

SEARCH source=customer
| FIELDS c_custkey
| LEFT OUTER JOIN hint.left = c, hint.right = o ON c.c_custkey = o.o_custkey
   [
      SEARCH source=orders
      | WHERE o_comment NOT LIKE '%unusual%packages%'
      | FIELDS o_orderkey, o_custkey
   ]
| STATS count(o.o_orderkey) AS c_count BY c.c_custkey
| STATS count(*) AS custdist BY c_count
| SORT - custdist, - c_count

Syntax Milestones:

1.1 Support two tables join

SEARCH source=<left-table>
| <other piped command>
| [joinType] JOIN left = <leftAlias> right = <rightAlias> ON joinCriteria <relation>
| <other piped command>

joinType: INNER | LEFT [OUTER] | [LEFT] SEMI | [LEFT] ANTI joinCriteria: <left-table>.<field>=<right-table>.<field> [AND <left-table>.<field>=<right-table>.<field>] or <leftAlias>.<field>=<rightAlias>.<field> [AND <leftAlias>.<field>=<rightAlias>.<field>]

<relation> could be sub-search, but sub-search is unsupported yet.

1.2 Support multiple tables join

SEARCH source=<left-table>
| <other piped command>
| [joinType] JOIN left = <leftAlias> right = <rightAlias> ON joinCriteria1 <relation1>
| <other piped command>
| [joinType] JOIN left = <leftAlias> right = <rightAlias> ON joinCriteria2 <relation2>
| <other piped command>
| [joinType] JOIN left = <leftAlias> right = <rightAlias> ON joinCriteria3 <relation3>
| <other piped command>

<relationX> could be sub-search, but sub-search is unsupported yet.

1.5 Support join hint

SEARCH source=<left-table>
| <other piped command>
| [joinType] JOIN left = <leftAlias> right = <rightAlias>
   [hint.left.key1=value1[ hint.right.key2=value2]]
   ON joinCriteria <relation>
| <other piped command>

<relation> could be sub-search, but sub-search is unsupported yet.

1.6 Support more join types

joinType: INNER | LEFT [OUTER] | [LEFT] SEMI | [LEFT] ANTI | FULL [OUTER] | CROSS | ...

What alternatives have you considered? No

Do you have any additional context? Add any other context or screenshots about the feature request here.

LantaoJin avatar Aug 08 '24 07:08 LantaoJin

[Catch All Triage - 1, 2, 3, 4, 5]

dblock avatar Aug 26 '24 16:08 dblock

  • +1 on 1 Re-implement the JOIN operator in SQL v2 engine and provide a the JOIN command in PPL only.
  • syntax looks good!
  • Do we consider different JOIN strategy? for instance, P0: support coordination-node join. P1: support broadcast join.
  • Can we colleberate with https://github.com/opensearch-project/OpenSearch/issues/15185 to support broadcast join?

penghuo avatar Sep 06 '24 20:09 penghuo

  • Do we consider different JOIN strategy? for instance, P0: support coordination-node join. P1: support broadcast join.
  • Can we colleberate with [RFC] Join support in OpenSearch OpenSearch to support broadcast join?

Thanks for bring the RFC of Core on table. All items are coordination-node joins. I will update this RFC once the https://github.com/opensearch-project/OpenSearch/issues/15185 could provide (or mention in milestones) an API to leverage here to support BCJ.

LantaoJin avatar Sep 09 '24 03:09 LantaoJin

@LantaoJin this looks great ! Are you also thinking about the planning step of the join order (in more than 2 tables) ? If this is the case the query planner should employ a statistical based cost estimator, can you please add this into the design ?

YANG-DB avatar Sep 09 '24 23:09 YANG-DB

I added two options of Join PPL syntax. Would you provide any comment based on the new updates? @penghuo @YANG-DB @dai-chen

LantaoJin avatar Sep 10 '24 03:09 LantaoJin

@LantaoJin this looks great ! Are you also thinking about the planning step of the join order (in more than 2 tables) ? If this is the case the query planner should employ a statistical based cost estimator, can you please add this into the design ?

Join reorder maybe depend on which syntax option do we choose as well. See my latest updates (two options).

LantaoJin avatar Sep 10 '24 03:09 LantaoJin

@LantaoJin thanks for the excellent review and suggestions - IMO option 2 if more corresponding to the pipeline style of syntax and also does look familiar from other pipe based languages... My vote if with option #2 P.S - reordering the join is a separate aspect of the join syntax style and should allow query rewrite (performance driven) in any syntax option we select...

YANG-DB avatar Sep 10 '24 16:09 YANG-DB

We support INNER, OUTER LEFT, and CROSS joins in SQL so that should be our baseline in PPL as well. Option 2 it sounds like it will give us more flexibility to grow as well in the future which will be better for the community.

brijos avatar Sep 11 '24 15:09 brijos

The implementation of PPL join command will be addressed as a part of https://github.com/opensearch-project/sql/issues/3229

LantaoJin avatar Jan 16 '25 06:01 LantaoJin

resolved in calcite engine since 3.0.0

LantaoJin avatar Jun 25 '25 10:06 LantaoJin