dbx creates a random directory everytime a deployment is done
Expected Behavior
When running the CLI command
dbx deploy --environment dev --deployment-file some_deployment.yml
I would expect dbx to give me the option to overwrite the previous wheel, and/or use my tagging strategy (v0.0.0, v0.0.1) to define the directory for the artifacts. So for example, the following command should have the following output:
$ dbfs ls dbfs:/dbx/my_package/ --profile=my-company-profile
dbfs:/dbx/my_package/v0.0.0
dbfs:/dbx/my_package/v.0.0.1
dbfs:/dbx/my_package/v.0.0.2
Ideally you could target also a specific version if you want to patch it, for example:
dbx deploy --environment dev --deployment-file some_deployment.yml --overwrite=v.0.0.2
This is just an idea, maybe we can come up with a better one. But the concept is to be able to really target a directory maybe with a specific tag. For the development flow I understand that this could generate conflicts (multiple branches doing multiple builds), but for production I really thing this is needed.
Current Behavior
However, the actual behaviour is that the target directory (in this case dbfs:/dbx/my_package) has a random temporary directory added to it, which generates a random folder for each deployment I do. Deployments pile up and I know have (after 2 months of using dbx) 75 deployment folders in dbfs. Example output from the terminal:
$ dbfs ls dbfs:/dbx/my_package/ --profile=my-company-profile
dbfs:/dbx/my_package/005d57e75e5b4482ab06f4432e0e88a3
dbfs:/dbx/my_package/008e46b1e20c41e1a71abef4a1e03eeb
dbfs:/dbx/my_package/00edb25776e449b993147091e9141083
dbfs:/dbx/my_package/01680e91a3d94d8984507101b7c47b54
dbfs:/dbx/my_package/01748315ff8d45aaa329023c5c940cbf
dbfs:/dbx/my_package/01b7a8f8c5a849ba8b11f27db5d6332b
dbfs:/dbx/my_package/0204220629074dea9e93bd4a371dcb2c
dbfs:/dbx/my_package/0205884a8a5d41398da369fe0f36a180
dbfs:/dbx/my_package/0229c6d4e9514fb9842277342e264003
dbfs:/dbx/my_package/024fcd1c0927448696874f1da2059d36
dbfs:/dbx/my_package/0290719a883649a5937e50e56e524684
dbfs:/dbx/my_package/02a19db5f0e4489eba821d37ac3e8f80
...
Steps to Reproduce (for bugs)
The some_deployment.yml file looks like:
build:
python: "pip"
dev:
workflows:
- name: "some_workflow"
job_clusters:
- job_cluster_key: some_cluster
new_cluster:
<<: *common-cluster-definition
autoscale:
min_workers: 2
max_workers: 4
tasks:
- task_key: some_task
python_wheel_task:
package_name: my_package
entry_point: script
job_cluster_key: some_cluster
timeout_seconds: 0
The project.json looks like:
{
"environments": {
"dev": {
"profile": "my-company-profile",
"storage_type": "mlflow",
"properties": {
"workspace_directory": "/Shared/dbx/projects/my_package",
"artifact_location": "dbfs:/dbx/my_package"
}
}
},
"inplace_jinja_support": false,
"failsafe_cluster_reuse_with_assets": false,
"context_based_upload_for_execute": false
}
And run the following command:
dbx deploy --environment dev --deployment-file some_deployment.yml
Context
It is really confusing to have a random folder being created with each deployment. This also means that 3 workflows pointing to the same wheel can be pointing to the same version of this wheel but three different builds in three different folders. Moreover, you cannot really clean up the dbfs path without checking what paths are your active workflows pointing too.
Your Environment
- dbx version used: 0.8.7
- Databricks Runtime version: 11.3 LTS
I think the problem may be in mlflow.log_artifact() function which is used by MlflowFileUploader, which is instantiated in the deploy() command:
https://github.com/databrickslabs/dbx/blob/dc0dd54be26de53b27c054400ec8c3c442bc8c2c/dbx/commands/deploy.py#L139
It is really confusing to have a random folder being created with each deployment.
Agreed. It is probably the most annoying thing about dbx. Being able override this behavior would be muy bueno.
Hi @guillesd,
this is expected behaviour and it's required for proper versioning.
If you don't like that these objects are stored you can use two things:
- For dbx execute use
--upload-via-context - For dbx deploy/launch (normal one, not the assets-only) store your packages in a proper package registry, e.g. Azure Artifacts or Nexus. By this you'll only keep the --assets-only deployments in the artifact storage.
Finally there is a function to cleanup - dbx destroy. Please read its arguments and parameters before launching though.
Hi @renardeinside,
Could you explain why this is required for proper versioning? Versioning for us happens in our Git provider, for us dbx deploy is a way to sync our artifacts to Databricks, the versioning logic is for us to specify in the CICD. Ergo if I want to tag a deployment with a specifc version and control the path of the directory where this is deployed (maybe using this version tag) in dbfs, why shouldn't I be able to?
I know that Nexus or Azure Artifacts could be a better option, but If I choose to do this in dbfs out of convenience, I'm expecting that I can control the behaviour so that in 3 month I don't end up with 100 folders containing wheels.
What I propose is that we bypass the MlflowFileUploader when the target is dbfs, and just use an extension of AbstractFileUploader see here. This extension would have some logic to pick up versioning tags to specify them in the path, and can also overwrite files if specified.
@renardeinside would you consider a PR for this?
I agree with @guillesd. I would be nice to customize the destination folder, e.g. by using a timestamp instead an hash.
However, I see that it make no sense to add all these configuration options to the MlflowFileUploader or to the AbstractFileUploader.
What I suggest is to take advantage of the storage_type field in the .dbx/project.json to allow the selection of a different AbstractFileUploader implementation.
It could be done in two ways:
- by allowing the registration of
AbstractFileUploaderimplementations, so the users need only to set:"storage_type": "my-storage-type" - by setting the
storage_typeto theAbstractFileUploaderimplementation full class name:"storage_type": "my_package.MyFileUploader"
The second approach should be simpler than the first to implement and document, I think.
After some search in the MLflow repository, I found the source of the "random directory name". It is generated in https://github.com/mlflow/mlflow/blob/master/mlflow/store/tracking/file_store.py#L616 with the unique run identifier and can't be customized by external code: it is used for tracking the artifacts of each specific run.
So, the MlflowFileUploader can't probably be customized for storing the artifacts in a specific dbfs folder or by replacing older one: we need a completely new implementation of AbstractFileUploader.
I 100% agree with @allebacco, we need a new implementation of the AbstractFileUploader that offers an alternative to the default MLFlow uploader, and that potentially exposes more control over where you are deploying to dbfs. For example, instead of a random hash or timestamp, i'd like to pass my own extension of the directory, i.e. dbfs://project-path/<my_own _tag>/wheel where <my_own_tag> could be something like the branch name or a release!
I am not sure if I understood the whole issue here but I feel this is more mlflow related than dbx. For mlflow, each "run" generates a separated folder (with the random name as you know) and it is mlflow server the one which tracks the versioning of this runs under "experiments". Each experiments, for dbx, is associated to a project. That's why, each time you deploy you get a new run with a new associated wheel.
I see your point of having just one "run" or environment when you do a deploy. However, during the development process you could face the issue of being in a version of your package more advanced than your last vX.X.X tag and make a deploy. mlflow and dbx will allow you generate this particular (unversioned) debug wheel and test it on the platform. If you overwrite a run (thing that you could using the mlflow Run class) you would loose the versioning @renardeinside is talking about.
I made a custom "dbx" (quite simpler) package for deployment and faced this issue too. We wanted to have package version matching deploys (or runs for mlflow point of view). The only way I found using mlflow was stopping the generation of new runs and check before track/log any artifact/metric into the experiment the list of Runs. By getting the runs' names and match my expected named. Then I continue registering metrics and artifacts to the same run without finishing. However, this is workaround that explodes the mlflow architecture flexibility.
One thing that could be useful, on the other hand, is allowing dbx to name the runs properly so you can get the runs by name using mlflow API. This would help you deploy on the same named environment.