cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

sql,cloud: cloud SQL shell can crash a node

Open DrewKimball opened this issue 1 year ago • 1 comments

It's possible to crash a node in a CC cluster through the SQL shell by attempting to run COMMIT or EXECUTE. This happens because requests through the cloud SQL shell run through an internal executor, which does not support committing the transaction, so a panic results. A crash occurs for similar reasons after SHOW COMMIT TIMESTAMP. There may be other statements that are incompatible with the internal executor as well.

The following is a screenshot I took of a graph of kubernetes node restarts after running COMMIT and ROLLBACK through the shell on a CC dedicated test cluster: Screenshot 2024-05-23 at 5 47 04 PM

We probably need to set up a list of disallowed statements for the cloud shell. For reference, we recently introduced a crdb_internal.execute_internally builtin function that has to do something similar: https://github.com/cockroachdb/cockroach/blob/5d90eb7f6d58aa23882c3dd7e5649cabc064b3e4/pkg/sql/sem/builtins/generator_builtins.go#L3595-L3602 However, we may want to relax the restriction somewhat, since it probably prohibits some safe statements.

Jira issue: CRDB-38983

DrewKimball avatar May 24 '24 00:05 DrewKimball

Hi @DrewKimball, please add branch-* labels to identify which branch(es) this C-bug affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] avatar May 24 '24 00:05 blathers-crl[bot]

We already add some context to any error returned by the internal executor in the run-query-via-api interface. We should consider also adding a panic-catcher with some logging in case we run into another assertion. We should make sure to log this to stderr in particular.

We might also consider turning all panics into explicitly returned errors, although we'd have to make sure this is safe.

DrewKimball avatar May 28 '24 16:05 DrewKimball

[quoting @DrewKimball during postmortem] steps to reproduce:

  1. start cloud dedicated cluster (serverless would probably also work, but we don't have observability into killing a node in a serverless cluster)
  2. open the cloud SQL console
  3. send either COMMIT; or ROLLBACK; or SHOW TRANSACTION TIMESTAMP;

michae2 avatar May 29 '24 18:05 michae2