cli icon indicating copy to clipboard operation
cli copied to clipboard

DAB deployment fails with `Error: cannot create job: NumWorkers could be 0 only for SingleNode clusters`

Open m-o-leary opened this issue 1 year ago • 21 comments

Describe the issue

When running databricks bundle deploy I get the following error

Potenitally sensitive info replaced with <internal ...>

Updating deployment state...
Error: terraform apply: exit status 1

Error: cannot create job: NumWorkers could be 0 only for SingleNode clusters. See https://docs.databricks.com/clusters/single-node.html for more details

  with <internal key>,
  on bundle.tf.json line 87, in resource.databricks_job.<internal key>:
  87:       },
...

The cluster definition at that point is:

{
  "job_cluster_key": "<internal key>",
  "new_cluster": {
    "aws_attributes": {
      "first_on_demand": 1,
      "instance_profile_arn": "<internal arn>"
    },
    "custom_tags": {
      "env": "dev",
      "owner": "datascience",
      "role": "databricks",
      "vertical": "datascience"
    },
    "data_security_mode": "SINGLE_USER",
    "node_type_id": "m6i.2xlarge",
    "num_workers": 0,
    "policy_id": "<internal policy id>",
    "spark_conf": {
      "spark.databricks.cluster.profile": "singleNode",
      "spark.databricks.delta.schema.autoMerge.enabled": "true",
      "spark.databricks.sql.initial.catalog.name": "<internal catalog name>",
      "spark.master": "local[*, 4]"
    },
    "spark_version": "13.2.x-cpu-ml-scala2.12"
  }
}

Steps to reproduce the behavior

Please list the steps required to reproduce the issue, for example:

  1. Install databricks CLI @ v0.222.0
  2. Run databricks bundle deploy
  3. See error
  4. Downgrade to v0.221.1
  5. Run databricks bundle deploy`
  6. See no error

Expected Behavior

Bundle should have deployed successfully

Actual Behavior

Bundle failed to deploy at the Updating deployment state... step

OS and CLI version

Please include the version of the CLI (eg: v0.1.2) and the operating system (eg: windows). You can run databricks --version to get the version of your Databricks CLI Databricks: v0.222.0 OS:

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Is this a regression?

Yes, v0.221.1 works as well as v0.217.0

Maybe related to https://github.com/databricks/cli/issues/592

m-o-leary avatar Jul 01 '24 10:07 m-o-leary

Thanks for reporting the issue. I'm investigating it.

pietern avatar Jul 01 '24 12:07 pietern

The following cluster definition works:

          new_cluster:
            node_type_id: i3.xlarge
            num_workers: 0
            spark_version: 14.3.x-scala2.12
            spark_conf:
                "spark.databricks.cluster.profile": "singleNode"
                "spark.master": "local[*, 4]"
            custom_tags:
                "ResourceClass": "SingleNode"

Note the presence of "ResourceClass": "SingleNode".

This may be able to get you unblocked while we figure out the underlying cause of this issue.

pietern avatar Jul 01 '24 12:07 pietern

A change in the Terraform provider (PR, released as part of v1.48.0) caused additional validation to run for job clusters. This includes a check for the ResourceClass field under custom_tags and that's why this error shows up if it isn't specified.

You can mitigate by including the following stanza in your job cluster definition:

custom_tags:
    "ResourceClass": "SingleNode"

Meanwhile, we're figuring out if this is something we should include transparently or not.

pietern avatar Jul 01 '24 13:07 pietern

The inclusion of the custom tag has helped resolve our issue - appreciate the support here.

georgealexanderday avatar Jul 01 '24 16:07 georgealexanderday

I am seeing the same issue after running my ci pipeline. My job configuration has the below lines, but still seeing the issue. Can you please help?

spark_conf:
                "spark.databricks.cluster.profile": "singleNode"
                "spark.master": "local[*, 4]"
            custom_tags:
                "ResourceClass": "SingleNode"

drelias15 avatar Jul 01 '24 20:07 drelias15

@drelias15 Are you sure you have the indentation right?

custom_tags needs to be at the same level as spark_conf.

pietern avatar Jul 02 '24 07:07 pietern

I believe the indentation is good. This snapshot is from the yml of the deployed job. One point, the spark.databricks.cluster.profile is defined in the cluster policy, so they are not defined in the workflows. Do you think that need to be present in the workflow definition? image

drelias15 avatar Jul 02 '24 17:07 drelias15

Hello, i've got the same issue and tried to add the custom_tags with the key: value "ResourceClass": "SingleNode"and nothing changed.

> databricks -v
Databricks CLI v0.222.0

Here's my bundle cluster conf after making a databricks bundle validate -o json:

"job_clusters": [
{
  "job_cluster_key": "job_cluster_single_node",
  "new_cluster": {
    "custom_tags": {
      "ResourceClass": "SingleNode"
    },
    "data_security_mode": "SINGLE_USER",
    "node_type_id": "Standard_DS3_v2",
    "num_workers": 0,
    "policy_id": "xxxxxxxxxxxxxxx",
    "spark_version": "13.3.x-cpu-ml-scala2.12"
  }
}
]

Thank you for the help !

Philippe-Neveux avatar Jul 04 '24 14:07 Philippe-Neveux

@drelias15 @Philippe-Neveux it's very likely that it's caused because (part of) spark_conf is defined not explicitly but in policy instead. What if you specify apply_policy_default_values: true in your new_cluster configuration? Does it help?

andrewnester avatar Jul 05 '24 09:07 andrewnester

Hi @andrewnester, thanks for the help ! Unfortunately, it doesn't help ... I'm still getting the same error Here's the output of databricks bundle validate -o json

"job_clusters": [
  {
    "job_cluster_key": "job_cluster_single_node",
    "new_cluster": {
      "apply_policy_default_values": true,
      "data_security_mode": "SINGLE_USER",
      "node_type_id": "Standard_DS3_v2",
      "num_workers": 0,
      "policy_id": "xxxxxxxxxxxxxxx",
      "spark_version": "13.3.x-cpu-ml-scala2.12"
    }
  }
]

Another thing, when i choose a multi nodes cluster, it works fine. Here's the databricks bundle validate -o json in the case of a multi nodes cluster

{
  "job_cluster_key": "job_cluster_single_node",
  "new_cluster": {
    "autoscale": {
      "max_workers": 5,
      "min_workers": 1
    },
    "data_security_mode": "SINGLE_USER",
    "node_type_id": "Standard_DS3_v2",
    "policy_id": "yyyyyyyyyyyyy",
    "spark_version": "13.3.x-cpu-ml-scala2.12"
  }
}
]

Philippe-Neveux avatar Jul 05 '24 12:07 Philippe-Neveux

@Philippe-Neveux just to be clear, when you say it doesn't work, you mean you receive this error, correct?

NumWorkers could be 0 only for SingleNode clusters.

andrewnester avatar Jul 05 '24 12:07 andrewnester

Exactly,

image

Philippe-Neveux avatar Jul 05 '24 13:07 Philippe-Neveux

Adding this to my job cluster specification solved it. Should be added to the spark configuration image

See also the documentation of the Terraform provider image

rebot avatar Jul 10 '24 10:07 rebot

@Philippe-Neveux @andrewnester I have exactly the same issue. We use policies so I'll put our policy example below. Also the deployment works with v0.221.1 and does not work with later versions. Tested with the current latest v0.225.0

{ "instance_pool_id": { "type": "fixed", "value": "..." }, "data_security_mode": { "type": "fixed", "value": "SINGLE_USER" }, "spark_version": { "type": "fixed", "value": "14.3.x-scala2.12", "hidden": true }, "spark_conf.spark.master": { "type": "fixed", "value": "local[*, 4]" }, "spark_conf.spark.databricks.cluster.profile": { "type": "fixed", "value": "singleNode", "hidden": true }, "custom_tags.ResourceClass": { "type": "fixed", "value": "SingleNode" } }

otrofimov avatar Aug 08 '24 01:08 otrofimov

@andrewnester Can I help in resolving this issue? We would like to use the latest CLI version for our pipelines.

otrofimov avatar Aug 23 '24 14:08 otrofimov

@otrofimov as @pietern pointed out above, the issue is coming form TF provider we use underneath which added additional validation https://github.com/databricks/cli/issues/1546#issuecomment-2200214738

You can unblock yourself by explicitly defining spark_conf and custom_tags in your DABs configuration instead of policy.

If you're willing to take a stab at fixing this at TF provider, we're open for external contributions :) Here you can see where the change was initially made https://github.com/databricks/terraform-provider-databricks/pull/3651

andrewnester avatar Aug 26 '24 10:08 andrewnester

I have a related error, not exactly the one discussed here.

We are using cluster policies for our job clusters created via Databricks Asset Bundles. In the DAB we are setting the cluster policy id and apply_policy_default_values: true but when the DAB is deployed, it fails because num_workers is not set to match the value set in the policy. But the value is set in policy to 4. Shouldn't it be using the values set in the policy and not require it to be set in the DAB, like it does for other values?

"data_security_mode": {
    "defaultValue": "USER_ISOLATION",
    "hidden": false,
    "type": "allowlist",
    "values": [
      "USER_ISOLATION",
      "SINGLE_USER"
    ]
},
"node_type_id": {
    "hidden": false,
    "type": "fixed",
    "value": "Standard_D8ads_v5"
}, 
"num_workers": {
    "hidden": true,
    "type": "fixed",
    "value": 4
}

In order to deploy the DAB successfully, I still need to set node_type_id, data_security_mode & num_workers in DAB as follows, though these settings are set in the cluster policy.

job_clusters:
  - job_cluster_key: datamart_cluster
    new_cluster:
      policy_id: ${var.cluster_policy_id}
      apply_policy_default_values: true
      node_type_id: ""
      num_workers: 4
      data_security_mode: ""

This is a problem because we have approx 100 DABs that use the same cluster policy, so to change any settings, we need to update all the DABs, our ideal experience to update cluster settings is, we modify the cluster policy and the cluster picks up those settings automatically from the cluster policy when it starts.

Please let me know if you would like me to start a new issue if this is very different from the issue discussed here. Thanks!

kaysonline avatar Oct 01 '24 17:10 kaysonline

@kaysonline just to confirm, which CLI version do you use?

andrewnester avatar Oct 01 '24 18:10 andrewnester

@andrewnester I am using Databricks CLI v0.228.1.

Following are some of the errors that we get:

  • When spark_version is set in the cluster policy and not set in job cluster definition: The argument "spark_version" is required, but no definition was found.
  • When num_workers is set in the cluster policy and not set in job cluster definition: Error: cannot update job: NumWorkers could be 0 only for SingleNode clusters. See https://docs.databricks.com/clusters/single-node.html for more details

kaysonline avatar Oct 02 '24 19:10 kaysonline

@kaysonline My understanding is that the checks for spark_version and single node clusters are taking place ONLY for your current job configuration. The combined job configuration that will be used after applying policies is not being checked. And this is the main problem here. All these checks are implemented on terraform-provider-databricks project side. https://github.com/databricks/terraform-provider-databricks/pull/3651 I want to spend some time figuring out how it works and how we can approach this. But for now we're just keeping some values hardcoded in our bundle configs like in the example below

single_node_cluster: &single_node_cluster policy_id: ${var.single_node_policy} apply_policy_default_values: true spark_version: ${var.lts_15_4_spark_version} custom_tags: "ResourceClass": "SingleNode" spark_conf: "spark.databricks.cluster.profile": "singleNode" "spark.master": "local[*, 4]"

otrofimov avatar Oct 02 '24 19:10 otrofimov

@kaysonline exactly as @otrofimov pointed out, the issue is on TF provider side and we plan to prioritise it higher as people are running into it

andrewnester avatar Oct 03 '24 12:10 andrewnester

Thank you @otrofimov & @andrewnester! Looking forward for TF provider updates!

kaysonline avatar Oct 14 '24 16:10 kaysonline

Hi all, https://github.com/databricks/terraform-provider-databricks/pull/4168 adds a more prescriptive error message that clearly outlines what the user needs to do to make single-node clusters work if they run into the original error in the issue:

Error: cannot create job: NumWorkers could be 0 only for SingleNode clusters

Roadmap for relevant clusters API changes

Meanwhile, the clusters team at Databricks is also actively working on improving the API to create single-node clusters. We plan to make it easier by adding an is_single_node field to the API so that you do not have to specify custom_tags and spark_conf to create a single-node cluster. This is likely to roll out in early 2025.

Why can't we downgrade the error to a warning? Can we unblock cluster policies for single node clusters?

Patching the TF provider to support defining cluster configuration specific to a single node cluster in a cluster policy is blocked by the clusters resource in the Databricks TF Provider still using SDK v2 of the Terraform Plugin SDK. Our current setup does not allow us to downgrade the instructions detailing how to correctly setup a single node cluster from an error to a warning. Thus it's blocked until we migrate the clusters resource in the Databricks TF Provider to the Terraform Plugin framework. It's not clear at this point when the migration will be prioritized on our roadmap.

Why are we not removing the validation all together that custom_tags and spark_conf is set when num_workers is 0?

Because setting num_workers to 0 without appropriate custom_tags and spark_conf values will just create a driver process and make the cluster non-functional for most non-trivial workloads.

Closing remarks

I'll be closing this issue since the original issue reported has been addressed in https://github.com/databricks/terraform-provider-databricks/pull/4168. Please feel free to open a separate issue for problems that have not been addressed here.

shreyas-goenka avatar Nov 07 '24 17:11 shreyas-goenka

Update: We have removed all client-side validation for single-node clusters and added a warning instead. All use cases described in this issue where users run into client-side limitations like policy_ids not working for single node clusters should be resolved now.

Please let us know if that's not the case! This should be live in the next release of the CLI (v0.236) which is yet to be released.

shreyas-goenka avatar Nov 25 '24 12:11 shreyas-goenka