spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-39930][SQL] Introduce Cache Hints

Open yaooqinn opened this issue 3 years ago • 2 comments

What changes were proposed in this pull request?

This ticket intends to add query hints for cache behaviors, users can perform actions like the use/skip/cache/uncache, etc on the cached results by the following hints.

  • RESULT_CACHE: spark will use cache by default, this hint keeps this behavior but will cache the child plan if uncached yet.

  • NO_RESULT_CACHE:with this hint, we can skip the default behavior to bypass the cached data if it is cached

  • RESULT_UNCACHE: with this hint, we can remove the cached data.

RESULT_CACHE and NO_RESULT_CACHE are borrowed from oracle, FYI,https://docs.oracle.com/database/121/TGDBA/tune_result_cache.htm#TGDBA647., while RESULT_UNCACHE's added accordingly via the naming rule. Similar ideas can also be found in MySQL, FYI https://dev.mysql.com/doc/refman/5.7/en/query-cache-in-select.html

Maybe we can add more cache hints like RESULT_RECACHE to refresh the cached data, or maybe some cache behaviors that relate to metadata.

Why are the changes needed?

As a supplement for CACHE/UNCACHE commands, pure SQL users can operate the cache in their queries directly without extra definitions. Also, cache skipping is supported by hints.

Does this PR introduce any user-facing change?

yes, this PR brings some new hints

How was this patch tested?

  • new tests for ut

  • test dist by hand

image

yaooqinn avatar Aug 01 '22 08:08 yaooqinn

cc @cloud-fan @MaxGekk @HyukjinKwon @wangyum, PTAL when you have time.

yaooqinn avatar Aug 01 '22 08:08 yaooqinn

I like the idea.

HyukjinKwon avatar Aug 02 '22 03:08 HyukjinKwon

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Nov 11 '22 00:11 github-actions[bot]

To close the loop: CACHE TABLE abc AS SELECT id from range(0,1) should be sufficient. If it fails with view already exists, we can either rerun it with a different name, or drop the temp view first.

If we want to combine multiple SQL statements, SQL code block with control flow ability seems a better feature to build.

cloud-fan avatar Nov 14 '22 03:11 cloud-fan

For a multi-tenant scenario like Thrift Server, if we explicitly cache/uncache some plans or relations, it will affect other's request

yaooqinn avatar Nov 15 '22 01:11 yaooqinn