databricks-cli icon indicating copy to clipboard operation
databricks-cli copied to clipboard

Add `jobs run-script` command

Open nfx opened this issue 5 years ago • 6 comments

Use case databricks jobs run-script metastore-export.py as equivalent for:

databricks workspace import metastore-export.py /tmp/metastore-export_$RANDOM --language python --overwrite
# creating Spark Submit job - https://docs.databricks.com/dev-tools/api/latest/jobs.html#runs-submit
# waiting for completion of the job, perhaps showing status update every 30 seconds
databricks fs rm dbfs:/tmp/metastore-export_$RANDOM

TBD: cluster parameters

nfx avatar Oct 13 '20 10:10 nfx

Could be fixed by #455

nfx avatar May 04 '22 17:05 nfx

With #455 you could do this:

databricks execution-context command-execute-once --cluster-id <CLUSTER_ID> --command "$(cat metastore-export.py)" --wait

fjakobs avatar May 05 '22 07:05 fjakobs

@fjakobs how can we make that invocation simpler?

nfx avatar May 05 '22 08:05 nfx

Just sketching:

We could

  • move the command under the cluster group
  • make --wait=True the default
  • add an argument that reads the command from a file
  • add an option to reference a cluster by name

databricks cluster execute --cluster-name <CLUSTER_NAME> --command-file metastore-export.py

fjakobs avatar May 05 '22 08:05 fjakobs

@fjakobs that looks a lot simpler!

how the results are going to be exported, e.g. if there's return in the end of the script?

nfx avatar May 05 '22 10:05 nfx

We could have different modes. The default could just be informative text based output like I have already implemented:

$ python -m databricks_cli.cli execution-context command-execute-once --cluster-id <CLUSTER_ID>  --command "$(cat spark.py)" --wait=True 
Status: Queued
Status: Running
Status: Finished


Command ID: c98edf8a-418a-4fa0-b69c-dcfae9db917e

output > +---------+----------+--------+----------+------+------+
output > |firstname|middlename|lastname|       dob|gender|salary|
output > +---------+----------+--------+----------+------+------+
output > |    James|          |   Smith|1991-04-01|     M|  3000|
output > |  Michael|      Rose|        |2000-05-19|     M|  4000|
output > |   Robert|          |Williams|1978-09-05|     M|  4000|
output > |    Maria|      Anne|   Jones|1967-12-01|     F|  4000|
output > |      Jen|      Mary|   Brown|1980-02-17|     F|    -1|
output > +---------+----------+--------+----------+------+------+

For use in shell scripts we can also just return the last status result with the embedded data as JSON:

$ python -m databricks_cli.cli execution-context command-execute-once --cluster-id <CLUSTER_ID>  --command "$(cat spark.py)" --wait=True --output=json

{
  "id": "b976fcff-8a32-4278-a3ff-cb684945e238",
  "status": "Finished",
  "results": {
    "resultType": "text",
    "data": "+---------+----------+--------+----------+------+------+\n|firstname|middlename|lastname|       dob|gender|salary|\n+---------+----------+--------+----------+------+------+\n|    James|          |   Smith|1991-04-01|     M|  3000|\n|  Michael|      Rose|        |2000-05-19|     M|  4000|\n|   Robert|          |Williams|1978-09-05|     M|  4000|\n|    Maria|      Anne|   Jones|1967-12-01|     F|  4000|\n|      Jen|      Mary|   Brown|1980-02-17|     F|    -1|\n+---------+----------+--------+----------+------+------+"
  }
}

For tabular data as returned from SQL commands a CSV output would also be nice.

fjakobs avatar May 05 '22 11:05 fjakobs