azureml.core.Run.log_*() logs are not working in child jobs
Hi everyone,
I am trying to to build an AML pipeline for object detectionc/instance segmentation, where the last component would be used for training and model evaluation.
The pipeline is defined via the YAML format/schema (see below) and is run with az ml job create --file pipeline.yaml:
- The pipeline itself is defined as:
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
- The pipeline components (Get Data, Train) are defined as:
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
type: command
I want to highlight/visualize a lot of metrics in the Metrics tab of the component like time-series metrics (loss, f1 etc.), X/Y graphs, confusion matrix etc. As the MLFlow API only support time-series-like metric logging (log a single metric value in each iteration/epoch etc.), for logging more advanced metrics, I try to use the azureml.core.Run.log* interface. The problem is that, these logs are only logged into the Output + logs as json files and not as metrics/graphs into the Metrics tab if they are logged at all. Here are the problematic metric logs:
-
azureml.core.Run.log_table(): This is not logged at all, nor into the Outputs + logs tab, nor into the Metrics tab. -
azureml.core.Run.log_accuracy_table(): This is logged only into the Outputs + logs tab as a json file.{"schema_type": "accuracy_table", "schema_version": "1.0.1", "data": {"probability_tables": [[[82, 118, 0, 0], [75, 31, 87, 7], [66, 9, 109, 16], [46, 2, 116, 36], [0, 0, 118, 82]], [[60, 140, 0, 0], [56, 20, 120, 4], [47, 4, 136, 13], [28, 0, 140, 32], [0, 0, 140, 60]], [[58, 142, 0, 0], [53, 29, 113, 5], [40, 10, 132, 18], [24, 1, 141, 34], [0, 0, 142, 58]]], "percentile_tables": [[[82, 118, 0, 0], [82, 67, 51, 0], [75, 26, 92, 7], [48, 3, 115, 34], [3, 0, 118, 79]], [[60, 140, 0, 0], [60, 89, 51, 0], [60, 41, 99, 0], [46, 5, 135, 14], [3, 0, 140, 57]], [[58, 142, 0, 0], [56, 93, 49, 2], [54, 47, 95, 4], [41, 10, 132, 17], [3, 0, 142, 55]]], "probability_thresholds": [0.0, 0.25, 0.5, 0.75, 1.0], "percentile_thresholds": [0.0, 0.01, 0.24, 0.98, 1.0], "class_labels": ["class1", "class2", "class3"]}} -
azureml.core.Run.log_confusion_matrix(): This is logged only into the Outputs + logs tab as a json file.{"schema_type": "confusion_matrix", "schema_version": "1.0.0", "data": {"class_labels": ["class1", "class2", "class3", "class4"], "matrix": [[4, 0, 1, 9], [0, 0, 0, 1], [6, 0, 5, 0], [0, 0, 0, 1]]}}
The codes used for these logs are as follows:
from azureml.core import Run
...
run = Run.get_context(allow_offline=False)
run.log_table("Y over X", {"x":[1, 2, 3], "y":[0.6, 0.7, 0.89]})
run.log_confusion_matrix(
name="Confusion matrix",
value = {
"schema_type": "confusion_matrix",
"schema_version": "1.0.0",
"data": {
"class_labels": ["class1", "class2", "class3", "class4"],
"matrix": [
[4, 0, 1, 9],
[0, 0, 0, 1],
[6, 0, 5, 0],
[0, 0, 0, 1]
]
}
}
)
run.log_accuracy_table(
name="Accuracy Table",
value= {
"schema_type": "accuracy_table",
"schema_version": "1.0.1",
"data": {
"probability_tables": [
[
[82, 118, 0, 0],
[75, 31, 87, 7],
[66, 9, 109, 16],
[46, 2, 116, 36],
[0, 0, 118, 82]
],
[
[60, 140, 0, 0],
[56, 20, 120, 4],
[47, 4, 136, 13],
[28, 0, 140, 32],
[0, 0, 140, 60]
],
[
[58, 142, 0, 0],
[53, 29, 113, 5],
[40, 10, 132, 18],
[24, 1, 141, 34],
[0, 0, 142, 58]
]
],
"percentile_tables": [
[
[82, 118, 0, 0],
[82, 67, 51, 0],
[75, 26, 92, 7],
[48, 3, 115, 34],
[3, 0, 118, 79]
],
[
[60, 140, 0, 0],
[60, 89, 51, 0],
[60, 41, 99, 0],
[46, 5, 135, 14],
[3, 0, 140, 57]
],
[
[58, 142, 0, 0],
[56, 93, 49, 2],
[54, 47, 95, 4],
[41, 10, 132, 17],
[3, 0, 142, 55]
]
],
"probability_thresholds": [0.0, 0.25, 0.5, 0.75, 1.0],
"percentile_thresholds": [0.0, 0.01, 0.24, 0.98, 1.0],
"class_labels": ["class1", "class2", "class3"]
}
},
description="Some description."
)
Here are some screenshots of the Azure ML dashboard.
- The first pic shows that
run.log_accuracy_table()andrun.log_confusion_matrix()are logged as json file artifacts butrun.log_table()is not: - The second pic shows that neither of the
run.log_*()metrics are visualized in the Metrics tab:
IMPORTANT
If I run a simple python script as a job (so no pipeline definitions etc.) the run.log_accuracy_table(), run.log_confusion_matrix() and _run.log_table() metrics are logged properly.
Is this behaviour just a bug related to child jobs?