pipelines Training Pipeline fails after uploading model artifacts to Google Cloud Storage

Environment:

KFP version 1.8.9

Steps to Reproduce

Here's a snippet of my training code:

    param_grid = {
    "max_tokens" : [100],
    "max_len" : [10],
    "dropout" : [0.1],
    }
    gs_model = GridSearchCV(KerasClassifier(build_model), param_grid, cv=3, scoring='accuracy')
    gs_model.fit(x_train, y_train, verbose = 1)
    best_params = gs_model.best_params_
    optimized_model = build_model(max_tokens = best_params["max_tokens"], max_len = best_params["max_len"], dropout = best_params["dropout"])
    optimized_model.fit(x_train, y_train, epochs = 3, validation_split = 0.2, callbacks = tensorflow.keras.callbacks.EarlyStopping(monitor='val_loss', patience=2, verbose = 1))
    model_name = "/tmp/custom-model-test"
    optimized_model.save(model_name)
    print('saved model to ', model_name)
    upload_from_directory(model_name, "[redacted Bucket name]", "custom-model-test")
    try: 
        upload_blob("[redacted Bucket name]", "goback-custom-train/requirements.txt", "custom-model-test/requirements.txt")
    except:
        print(traceback.format_exc())
        print('Upload failed')

Which succeeds in uploading to Google Cloud Storage. It makes use of model.save from Keras, and saves the created directory to my Bucket, along with a requirements.txt file inside it. To be clear, once the code block above is ran, a directory custom-model-test/ is created in gs://[redacted Bucket name] with contents requirements.txt and tmp/. Inside tmp/ is keras-metadata.pb, saved_model.pb, and variables/.

I run this container in the following codeblock in my Kubeflow Pipeline:

    training_job_run_op = gcc_aip.CustomContainerTrainingJobRunOp(
        project = project,
        display_name = display_name,
        container_uri=training_container_uri,
        model_serving_container_image_uri=model_serving_container_image_uri,
        model_serving_container_predict_route = model_serving_container_predict_route,
        model_serving_container_health_route = model_serving_container_health_route,
        model_serving_container_ports = [8080],
        service_account = "[redacted service account]",
        machine_type = "n1-highmem-2",
        accelerator_type ="NVIDIA_TESLA_V100",
        staging_bucket = BUCKET_NAME)

For some reason, after training and saving the model artifacts (the logs for the model training says it completed successfully) the pipeline fails with logs saying:

    " File "/opt/python3.7/lib/python3.7/site-packages/google/cloud/aiplatform/training_jobs.py", line 905, in _raise_failure "
    " raise RuntimeError("Training failed with:\n%s" % self._gca_resource.error) "
    "RuntimeError: Training failed with: "
    "code: 5
    "message: "There are no files under \"gs://[redacted Bucket name]/aiplatform-custom-training-2022-04-21-14:04:46.151/model\" to copy."
    "

What's going on here? What's the fix?

Apr 21 '22 15:04 sangstar

/assign @IronPan

Apr 28 '22 22:04 jlyaoyuli

Hi there - were you able to resolve this issue? Any tips? I'm seeing the same thing on my end.

Jun 11 '22 16:06 RE-Wolfe

Hi there, I met the same problem too. any ideas, please?

Sep 09 '22 13:09 LaoLiulaoliu

I'm running into a similar problem, as well. Any solutions?

Sep 28 '22 21:09 vbucaj

same problem

Oct 29 '22 20:10 cutlass90

same here, please keep in the loop

Nov 09 '22 15:11 xnoar747

I have the same problem

Mar 07 '23 09:03 charlieyang1557

CustomContainerTrainingJobRunOp was removed in GCPC 2.0.0, and it is replaced by CustomTrainingJobOp. Please give that a try and let us know if the issue still exists.

Oct 26 '23 18:10 chensun