azureml.core.Run.log_*() logs are not working in child jobs

Open pezosanta opened this issue 2 years ago • 0 comments

Hi everyone,

I am trying to to build an AML pipeline for object detectionc/instance segmentation, where the last component would be used for training and model evaluation.

The pipeline is defined via the YAML format/schema (see below) and is run with az ml job create --file pipeline.yaml:

The pipeline itself is defined as:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline

The pipeline components (Get Data, Train) are defined as:

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
type: command

I want to highlight/visualize a lot of metrics in the Metrics tab of the component like time-series metrics (loss, f1 etc.), X/Y graphs, confusion matrix etc. As the MLFlow API only support time-series-like metric logging (log a single metric value in each iteration/epoch etc.), for logging more advanced metrics, I try to use the azureml.core.Run.log* interface. The problem is that, these logs are only logged into the Output + logs as json files and not as metrics/graphs into the Metrics tab if they are logged at all. Here are the problematic metric logs:

azureml.core.Run.log_table(): This is not logged at all, nor into the Outputs + logs tab, nor into the Metrics tab.
azureml.core.Run.log_accuracy_table(): This is logged only into the Outputs + logs tab as a json file. {"schema_type": "accuracy_table", "schema_version": "1.0.1", "data": {"probability_tables": [[[82, 118, 0, 0], [75, 31, 87, 7], [66, 9, 109, 16], [46, 2, 116, 36], [0, 0, 118, 82]], [[60, 140, 0, 0], [56, 20, 120, 4], [47, 4, 136, 13], [28, 0, 140, 32], [0, 0, 140, 60]], [[58, 142, 0, 0], [53, 29, 113, 5], [40, 10, 132, 18], [24, 1, 141, 34], [0, 0, 142, 58]]], "percentile_tables": [[[82, 118, 0, 0], [82, 67, 51, 0], [75, 26, 92, 7], [48, 3, 115, 34], [3, 0, 118, 79]], [[60, 140, 0, 0], [60, 89, 51, 0], [60, 41, 99, 0], [46, 5, 135, 14], [3, 0, 140, 57]], [[58, 142, 0, 0], [56, 93, 49, 2], [54, 47, 95, 4], [41, 10, 132, 17], [3, 0, 142, 55]]], "probability_thresholds": [0.0, 0.25, 0.5, 0.75, 1.0], "percentile_thresholds": [0.0, 0.01, 0.24, 0.98, 1.0], "class_labels": ["class1", "class2", "class3"]}}
azureml.core.Run.log_confusion_matrix(): This is logged only into the Outputs + logs tab as a json file. {"schema_type": "confusion_matrix", "schema_version": "1.0.0", "data": {"class_labels": ["class1", "class2", "class3", "class4"], "matrix": [[4, 0, 1, 9], [0, 0, 0, 1], [6, 0, 5, 0], [0, 0, 0, 1]]}}

The codes used for these logs are as follows:

from azureml.core import Run
...

run = Run.get_context(allow_offline=False)
run.log_table("Y over X", {"x":[1, 2, 3], "y":[0.6, 0.7, 0.89]})
run.log_confusion_matrix(
        name="Confusion matrix",
        value = {
            "schema_type": "confusion_matrix",
            "schema_version": "1.0.0",
            "data": {
                "class_labels": ["class1", "class2", "class3", "class4"],
                "matrix": [
                    [4, 0, 1, 9],
                    [0, 0, 0, 1],
                    [6, 0, 5, 0],
                    [0, 0, 0, 1]
                ]
            }
        }
    )
run.log_accuracy_table(
        name="Accuracy Table",
        value= {
            "schema_type": "accuracy_table",
            "schema_version": "1.0.1",
            "data": {
                "probability_tables": [
                    [
                        [82, 118, 0, 0],
                        [75, 31, 87, 7],
                        [66, 9, 109, 16],
                        [46, 2, 116, 36],
                        [0, 0, 118, 82]
                    ],
                    [
                        [60, 140, 0, 0],
                        [56, 20, 120, 4],
                        [47, 4, 136, 13],
                        [28, 0, 140, 32],
                        [0, 0, 140, 60]
                    ],
                    [
                        [58, 142, 0, 0],
                        [53, 29, 113, 5],
                        [40, 10, 132, 18],
                        [24, 1, 141, 34],
                        [0, 0, 142, 58]
                    ]
                ],
                "percentile_tables": [
                    [
                        [82, 118, 0, 0],
                        [82, 67, 51, 0],
                        [75, 26, 92, 7],
                        [48, 3, 115, 34],
                        [3, 0, 118, 79]
                    ],
                    [
                        [60, 140, 0, 0],
                        [60, 89, 51, 0],
                        [60, 41, 99, 0],
                        [46, 5, 135, 14],
                        [3, 0, 140, 57]
                    ],
                    [
                        [58, 142, 0, 0],
                        [56, 93, 49, 2],
                        [54, 47, 95, 4],
                        [41, 10, 132, 17],
                        [3, 0, 142, 55]
                    ]
                ],
                "probability_thresholds": [0.0, 0.25, 0.5, 0.75, 1.0],
                "percentile_thresholds": [0.0, 0.01, 0.24, 0.98, 1.0],
                "class_labels": ["class1", "class2", "class3"]
            }
        },
        description="Some description."
    )

Here are some screenshots of the Azure ML dashboard.

The first pic shows that run.log_accuracy_table() and run.log_confusion_matrix() are logged as json file artifacts but run.log_table() is not:
The second pic shows that neither of the run.log_*() metrics are visualized in the Metrics tab:

IMPORTANT

If I run a simple python script as a job (so no pipeline definitions etc.) the run.log_accuracy_table(), run.log_confusion_matrix() and _run.log_table() metrics are logged properly.

aml-simple-job

Is this behaviour just a bug related to child jobs?

Jun 26 '23 12:06 pezosanta