dbt-databricks icon indicating copy to clipboard operation
dbt-databricks copied to clipboard

Python model serverless workflow doesn't accept environment libraries

Open dustinvannoy-db opened this issue 9 months ago • 9 comments

Describe the bug

I am testing Python Models submitted to serverless job compute. I was trying two submission methods: workflow_job and serverless_cluster. I need to add a library dependency but can not get it to work with workflow_job.

Steps To Reproduce

Model code that is attempting to use a library that needs installed from PyPI while running on serverless job compute.

from faker import Faker

def model(dbt, session):
    dbt.config(
        submission_method='workflow_job',
        environment_key="my_env",
        environment_dependencies=["faker==37.0.2"])

    my_sql_model_df = dbt.ref("CustomerIncremental")

    fake = Faker()
    print(fake.name())

    final_df = my_sql_model_df.selectExpr("*").limit(100)

    return final_df

Response: Runtime Error in model DimCustomer3 (Databricks/models/main/python/DimCustomer3.py) Python model failed with traceback as: (Note that the line number here does not match the line number in your code due to dbt templating) ModuleNotFoundError: No module named 'faker'

Expected behavior

If I use this code with serverless_cluster method it works. It should be the same for workflow_job.

from faker import Faker

def model(dbt, session):
    dbt.config(
        submission_method='serverless_cluster',
        environment_key="my_env",
        environment_dependencies=["faker==37.0.2"])

    my_sql_model_df = dbt.ref("CustomerIncremental")

    fake = Faker()
    print(fake.name())

    final_df = my_sql_model_df.selectExpr("*").limit(100)

    return final_df

Response: 04:18:26 Finished running 1 table model in 0 hours 1 minutes and 55.52 seconds (115.52s). 04:18:26
04:18:26 Completed successfully 04:18:26
04:18:26 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

serverless_cluster working environment: Image

workflow_job no environment shown: Image

System information

The output of dbt --version:

Core:
  - installed: 1.9.3
  - latest:    1.9.4 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - databricks: 1.9.7 - Update available!
  - spark:      1.8.0 - Update available!

The operating system you're using: MacOS Sequoia 15.4

The output of python --version: Python 3.10.17

Additional context

Add any other context about the problem here.

dustinvannoy-db avatar May 02 '25 04:05 dustinvannoy-db

Hey @dustinvannoy-db - the workflow_job mostly just wraps the REST API. Does it work if you use something like:

dbt.config(
  environments=[
    {
      "environment_key": "my_env",
      "spec": {
        "client": "1",
        "dependencies": [
          "faker",
        ]
      }
    }
  ],
  additional_task_settings={ environment_key: "my_env" } 
)

Although that wouldn't help with the two submission methods being consistent.

kdazzle avatar Sep 15 '25 21:09 kdazzle

Hello,

I am trying to run a dbt python model on a serverless cluster, but the job run is constantly in environment version 1, and it does not install the libraries I specify. I also tried @kdazzle's suggestion, but it still is in environment version 1 and my libraries are not installed. Here is my latest attempt:

dbt.config(
        materialized='table',
        submission_method='serverless_cluster',
        environments=[
            {
                "environment_key": "my_env",
                "spec": {
                    "environment_version": "4",
                    "dependencies": [
                        "lightgbm==4.6.0",
                        "mlflow>=2.14.0",
                    ]
                }
            }
        ],
        additional_task_settings={ "environment_key": "my_env" } 
        
        # environment_key="my_env",
        # environment_dependencies=["lightgbm==4.6.0", "mlflow>=2.14.0"]
    )

I have also tried to specify the config in a yaml file for python dbt model like this:

version: 2
models:
  - name: silver_connection_errand_categorization_inference
    config:
      materialized: table
      submission_method: serverless_cluster
      packages:
        - "lightgbm==4.6.0"
        - "mlflow>=2.14.0"

But this gives me the error

Error creating python run.
   b'{"error_code":"INVALID_PARAMETER_VALUE","message":"Libraries field is not supported for serverless task, please specify libraries in environment.","details":[{"@type":"type.googleapis.com/google.rpc.RequestInfo","request_id":"<some-id>","serving_data":""}]}'

How do you run a dbt python model in Databricks, with serverless cluster and installed libraries that are not per default installed in the serverless cluster?

ikhudur avatar Sep 24 '25 14:09 ikhudur

Hi @ikhudur - yeah unfortunately the APIs for different submissions methods aren't the same. Try changing the submission method to workflow_job for that first version you posted. And I haven't used python config in a bit, but the yaml would look something like the below (docs here).

Using the workflow job creates a workflow instead of a one time run, and also mirrors the Databricks API. I'm pretty sure that leaving out the cluster information defaults to serverless, since that's the way the API also works.

models:
  - name: my_model
    config:
      submission_method: workflow_job
      python_job_config:
        additional_task_settings: { "environment_key": "my_env" }
        environments: [
            {
                "environment_key": "my_env",
                "spec": {
                    "environment_version": "4",
                    "dependencies": [
                        "lightgbm==4.6.0",
                        "mlflow>=2.14.0",
                    ]
                }
            }
        ]

kdazzle avatar Sep 24 '25 15:09 kdazzle

Hello @kdazzle,

Ah, thank you, I will try it out!

Regarding the environment_key, I am wondering if this is something that is created specifically for the workflow/job run (as defined under the environments field)? Or does this correspond to a base environment you can create in Databricks here:

Image

ikhudur avatar Sep 25 '25 10:09 ikhudur

Hi @ikhudur - the workflow_job submission method was created to get around some of the sandboxing that dbt forced us into with other submission methods, so we could take more advantage of Databricks features. So basically, most attributes, like the environments list, get passed straight through to the Databricks jobs API - https://docs.databricks.com/api/workspace/jobs/create

kdazzle avatar Sep 26 '25 15:09 kdazzle

Hello again @kdazzle,

I tried the following config:

version: 2
models:
  - name: my_model
    config:
      materialized: table
      submission_method: workflow_job
      python_job_config:
        additional_task_settings: { "environment_key": "my_env" }
        environments: [
            {
                "environment_key": "my_env",
                "spec": {
                    "environment_version": "4",
                    "dependencies": [
                        "xgboost==3.0.5",
                        "mlflow>=2.14.0",
                    ]
                }
            }
        ]

But it failed with the following error:

Error creating Workflow.
   b'{"error_code":"INVALID_PARAMETER_VALUE","message":"A task environment can not be provided for notebook task inner_notebook. Please use the %pip magic command to install notebook-scoped Python libraries and Python wheel packages","details":[{"@type":"type.googleapis.com/google.rpc.RequestInfo","request_id":"<request-id>","serving_data":""}]}'

ikhudur avatar Sep 29 '25 14:09 ikhudur

Ah interesting @ikhudur . Sounds like the environment is getting through to the job like we were hoping, but that there might be some issue installing the environment libraries onto a serverless cluster. I'm not super familiar with environments, but I would guess they might not be very supported in serverless?

Have you tried pip installing them like suggested? You probably already know that dbt python jobs are just uploaded as notebooks with a couple of functions appended at the end, but some people don't. So you can just add a cell at the top of your python job like:

# COMMAND ----------

# MAGIC %pip install xgboost mflow

# COMMAND ----------

<the rest of your python>
import mlflow


def model(dbt, session):

    my_sql_model_df = dbt.ref("my_sql_model")

    final_df = ...  # stuff you can't write in SQL!

    return final_df

I forget the exact syntax of the magic pip command, but it's something like that.

kdazzle avatar Sep 29 '25 21:09 kdazzle

Hello again @kdazzle ,

Thank you for the suggestion! I did see that the dbt python model was an inner notebook, but did not think of using the magic pip command.

I tried it now like this:

# COMMAND ----------

# MAGIC %pip install xgboost==3.0.5, mlflow>=2.14.0
# MAGIC %restart_python

# COMMAND ----------

Added %restart_python, because Databricks prompts you to do it to make sure the packages are installed after running pip install.

However, this still does not work as I get the error:

ModuleNotFoundError: No module named 'mlflow'

I have tried both with and without %restart_python

Also, the environment version of the serverless cluster is always 1 (cannot change it to 4)

ikhudur avatar Sep 30 '25 08:09 ikhudur

I managed to get it working.

Installing dependencies (works)

As @kdazzle mentioned, the dbt python jobs are uploaded as notebooks, so installing the libraries can be done using magic commands in the .py file. If you look at the Databricks job run, the cell that runs the pip install commands ends with the following message:

Note: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.

I thought %restart_python and dbutils.library.restartPython() are the same command, but they are not...

To make sure the packages are installed, you have to run dbutils.library.restartPython() after the pip install commands, not %restart_python. So the beginning of the dbt python model would look like this:

# COMMAND ----------

# MAGIC %pip install xgboost==3.0.5 mlflow==3.4.0

# COMMAND ----------

# MAGIC dbutils.library.restartPython()

# COMMAND ----------

from typing import Any

import mlflow
import numpy as np
import pandas as pd
from pyspark.sql import DataFrame, SparkSession
from pyspark.sql import functions as F
def model(dbt: Any, session: SparkSession) -> DataFrame:

    <rest_of_your_python_dbt_model_code_here>

Submission method

Now that I managed to install the packages, I wanted to change the environment version of the serverless cluster. I first tried doing it for submission_method: serverless_cluster, but that did not work.

I then did it successfully for submission_method: workflow_job

Here is the config for my dbt python model, that runs on serverless compute and in a Databricks workflow:

version: 2
models:
  - name: <my_model>
    config:
      materialized: table
      submission_method: workflow_job
      python_job_config:
        name: <my_workflow_name>
        environments:
          - environment_key: my_key # Not sure what this is used for, but I kept it as 'my_key'
            spec:
              environment_version: 4

An odd thing is that the serverless environment version is set to 3 in the workflow job run, and not 4 as I specify it in the config.

Any ideas @kdazzle why that happens?

Edit: Using submission_method: workflow_job creates new workflows, unless a user/service principal/group has access to the workflow. And, only the owner of the workflow can run start a job run in the workflow, otherwise you get the error

b'{"error_code":"PERMISSION_DENIED","message":"<user-or-service-principal-id> does not have Manage permissions on Jobs: only workspace admins can change the owner of a job. Please contact the owner or an administrator for access."

Am I missing something? Why does the user/service princiap who runs dbt python model have to be owner of it?

ikhudur avatar Oct 02 '25 11:10 ikhudur