transformers icon indicating copy to clipboard operation
transformers copied to clipboard

[WIP] Add tf swiftformer

Open joaocmd opened this issue 2 years ago • 91 comments

What does this PR do?

Adds the TensorFlow version of the "SwiftFormer".

Fixes #22771

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [X] Did you read the contributor guideline, Pull Request section?
  • [X] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. https://github.com/huggingface/transformers/issues/22771
  • [x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [x] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@amyeroberts @D-Roberts

joaocmd avatar May 12 '23 20:05 joaocmd

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Hi @joaocmd, Rapid work on opening the TF port! Let me or @Rocketknight1 know when the PR is ready for review or you experience any issues when porting.

amyeroberts avatar May 15 '23 10:05 amyeroberts

Hi @Rocketknight1, could I get some pointers as to why I get errors like in most of the tests:

E               ValueError: Exception encountered when calling layer 'tf_swift_former_model_18' (type TFSwiftFormerModel).
E               
E               The following keyword arguments are not supported by this model: ['input_ids'].
E               
E               Call arguments received by layer 'tf_swift_former_model_18' (type TFSwiftFormerModel):
E                 • pixel_values={'pixel_values': 'tf.Tensor(shape=(13, 224, 224, 3), dtype=float32)'}
E                 • output_hidden_states=None
E                 • return_dict=None
E                 • training=False

src/transformers/modeling_tf_utils.py:500: ValueError

The PyTorch model has this following docstring but I don't see where the input_ids part is being taken care of.

"""
    Here we also overwrite some of the tests of test_modeling_common.py, as SwiftFormer does not use input_ids, inputs_embeds,
    attention_mask and seq_length.
"""

Thanks!

joaocmd avatar May 18 '23 21:05 joaocmd

It seems like it is entering in the else statement at line 581 of src/transformers/modeling_tf_utils.py:

if "args" in output:
    if output["args"] is not None and is_tf_symbolic_tensor(output["args"]):
        tensor_name = output["args"].name.split(":")[0]
        output[tensor_name] = output["args"]
    else:
        # `args` in this case is always the first parameter, then `input_ids`
        output["input_ids"] = output["args"]

    del output["args"]

Thus it is injecting the input_ids argument into the dictionary.

@amyeroberts @Rocketknight1 How should I get around this? It must be some misconfiguration in my tests or models.

joaocmd avatar May 28 '23 12:05 joaocmd

@joaocmd Just looking at the error and the CI runs, I think the issue might be a missing @unpack_inputs decorator on the call method for the MainLayer class

amyeroberts avatar May 30 '23 20:05 amyeroberts

@joaocmd Just looking at the error and the CI runs, I think the issue might be a missing @unpack_inputs decorator on the call method for the MainLayer class

Thank you @amyeroberts! It seems like that wasn't causing any issue (yet), but thanks to your comment I found out that I had a duplicate @unpack_inputs in one of the models.

joaocmd avatar May 30 '23 21:05 joaocmd

Hi @amyeroberts and @Rocketknight1, can I get some help with the tests that are still failing? I'm getting ValueError: cannot reshape array of size 10368 into shape (3,3,3,24) for these two tests:

  • tests/models/swiftformer/test_modeling_tf_swiftformer.py::TFSwiftFormerModelTest::test_compile_tf_model
  • tests/models/swiftformer/test_modeling_tf_swiftformer.py::TFSwiftFormerModelTest::test_save_load

But I don't understand exactly what is being reshaped into the wrong shape. Could I get some insight as to what these tests are doing and why it might be failing? Thanks!

joaocmd avatar Jun 17 '23 18:06 joaocmd

Hi @joaocmd, there's been some large updates to our TF models regarding how they're built - @Rocketknight1 can give you more details :)

Are these errors happening if you rebase on main?

amyeroberts avatar Jun 19 '23 16:06 amyeroberts

Hi @joaocmd, there's been some large updates to our TF models regarding how they're built - @Rocketknight1 can give you more details :)

Are these errors happening if you rebase on main?

Hi @amyeroberts, just rebased the branch. I think it's failing on the same tests but the error on these two tests changed to:

NotImplementedError: Could not infer input image shape from config, please override input_signature to specify input shapes.

Looking at the stack trace it seems like the image size should have been specified:

if hasattr(vision_config, "image_size"):
    pixel_values_shape[2] = pixel_values_shape[3] = vision_config.image_size
elif hasattr(vision_config, "input_size"):
    pixel_values_shape[2] = pixel_values_shape[3] = vision_config.input_size
else:
   raise NotImplementedError( # <------ this error here
        "Could not infer input image shape from config, please override input_signature to specify input shapes."
    )

Shouldn't this also affect the original model?

joaocmd avatar Jun 22 '23 21:06 joaocmd

@joaocmd Regarding the error, no, it shouldn't affect the original model. image_size is a parameter we add in the configs, even if it's not always used by the model as it's often important for parameterizing other things or understanding. We allow this here. It should have been added, and we can add in this PR, but the PT model can do without.

You'll notice that the error is being raise in modeling_tf_utils.py. This is because when constructing a TF model, we have to pass in dummy inputs to build it. In PyTorch this isn't necessary, because we explicitly set the input and output dimensions when creating each layer, so the weight matrices can be created immediately. image_size is needed to know the shape of the inputs to pass in.

As a side note, did you force push after rebasing? From the PR history, it looks like you might not have. As rebasing is a form of "rewriting history" it's necessary to force push.

amyeroberts avatar Jun 23 '23 11:06 amyeroberts

Thanks @amyeroberts, understood. As for the rebase, I had not done one in quite some time and it seems like I did mess it up. I think that is now fixed.

Since I started this PR I have had a fundamental question about huggingface's approach to tensorflow models. The default in TensorFlow is NHWC while in PyTorch it is NCHW, how should I approach this difference in my PR? Based on modeling_tf_vit.py I suppose the correct approach is to assume that images are given in PyTorch format and transpose them in the first block, is that so? How does that affect the following blocks? Also, if we were implementing a model for semantic segmentation, which would return an image with the same size as the original one, would that be returned in the PyTorch format or the default TensorFlow format?

Thank you!

joaocmd avatar Jun 23 '23 21:06 joaocmd

@joaocmd The pattern we use for the TF vision models is to transpose the NCHW format in the first MainLayer class e.g. here and then transpose back, if pixel values are returned e.g. here. For some of the older models e.g. ViT this pattern may not have been applied, as these were the first models to be added.

This pattern means the model is written in the TF compatible NHWC format throughout, but all of our vision models accept and return images in NCHW.

amyeroberts avatar Jun 26 '23 09:06 amyeroberts

Thank you @amyeroberts, that makes sense. I've already updated it to match the pattern.

I'm still having some trouble with the test_compile_tf_model. Initially it was failing because it was passing a shape (None, 56, 56, 48) to a reshape (https://github.com/huggingface/transformers/pull/23342/commits/204e216e6047d83775cfb5f0d928b378b73d2e84#diff-7f093399e807b53ca4b63460f610dcc550c2937cb18cd513d71dc49ce6e1b699R385). I changed the line to use [-1, width * height, channels] as shape, which seems like it fixed that case. However, now it is failing because a shape (None, None, None, 48) is being passed to that reshape call. Is this expected of this test? According to the stack trace it seems like it's being triggered by a tf.keras.Model.save() (https://github.com/joaocmd/transformers/blob/add_tf_swiftformer/tests/test_modeling_tf_common.py#L711).

I've also noticed that there was an overhaul to the serving and dummy_inputs interface (https://github.com/huggingface/transformers/commit/814de8fac7456bd2ce50d1847505da829761bfdc). But maybe @Rocketknight1 can better explain the consequences of this change to mine (and other) PRs.

joaocmd avatar Jun 27 '23 21:06 joaocmd

@joaocmd Yes, there was a big refactor of the serving_output logic. For most models, there's no need to have serving_output, dummy_inputs or serving implemented. You should be able to remove these and have the test_prepare_serving_output test pass.

Looking at the CI run, I don't see test_compile_tf_model failing. Were you able to resolve? Or perhaps are you refering to test_save_load?

amyeroberts avatar Jul 14 '23 19:07 amyeroberts

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Aug 08 '23 15:08 github-actions[bot]

Hi @amyeroberts! Sorry for the late response as I've been quite busy... It was failing more tests on my local machine than on the CI run, but after merging the main branch locally they now seem to match. I am currently struggling with test_save_load:

ValueError: cannot reshape array of size 10368 into shape (3,3,3,24)

I can't find the reason for this error. I set up a breakpoint and found that the symbolic_weight_name at that point is kernel:0, so I assume it belongs to some convolutional layer, but I didn't get any further than that. Do you have any suggestions? Thank you!

Edit:

I believe the weight belongs to the patch embeddings layer, which I initialized with a tf.keras.Sequential call:

self.patch_embedding = tf.keras.Sequential(
    [
        tf.keras.layers.ZeroPadding2D(padding=(1, 1)),
        tf.keras.layers.Conv2D(out_chs // 2, kernel_size=3, strides=2),
        tf.keras.layers.BatchNormalization(
            epsilon=config.batch_norm_eps, momentum=0.9
        ),  # FIXME: is this the equivalent momentum?
        tf.keras.layers.Activation("relu"),
        tf.keras.layers.ZeroPadding2D(padding=(1, 1)),
        tf.keras.layers.Conv2D(out_chs, kernel_size=3, strides=2),
        tf.keras.layers.BatchNormalization(
            epsilon=config.batch_norm_eps, momentum=0.9
        ),  # FIXME: is this the equivalent momentum?
        tf.keras.layers.Activation("relu"),
    ],
    name="patch_embeddings",
)

I think the problem is that both Conv2D are being given the same name, what is the correct approach for this? Should I rewrite the pytorch version to not use nn.Sequential?

joaocmd avatar Aug 13 '23 16:08 joaocmd

@joaocmd I would suggest rewriting the torch version to not use sequential, but only for the purposes of debugging i.e. we wouldn't commit these changes to main. This way you'll be able to go line by line comparing the TF and PT outputs and seeing where any shape differences are coming from.

amyeroberts avatar Aug 14 '23 09:08 amyeroberts

Hi @amyeroberts I might be misunderstanding something but I think test_save_load does not test any PyTorch to TensorFlow equivalence. I think the problem is that when the two convolutional layers inside the Sequential module are saved they are stored under the same name, so a shape mismatch happens. Do I understand this correctly?

joaocmd avatar Aug 15 '23 14:08 joaocmd

@joaocmd Ah, apologies, I misread your comment. Yes, I believe you're right about the the naming issue. What I suggest is follow the pattern in other ports where nn.Sequential has been used. For example in deit, for the sequential block in PT, a new layer is implemented for TF.

amyeroberts avatar Aug 16 '23 15:08 amyeroberts

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Sep 11 '23 08:09 github-actions[bot]

@joaocmd From next week I'll be off for a few weeks. If you have any implementation questions, please ask @Rocketknight1 :)

amyeroberts avatar Sep 12 '23 17:09 amyeroberts

Hi @Rocketknight1, could you give me some pointers on the three remaining tests? I haven't looked closely into test_modeling_tf_swiftformer.py::TFSwiftFormerModelIntegrationTest::test_inference_image_classification_head yet because I think it makes sense to leave that one for last, but correct me if I'm wrong.

However, I am trying to understand what is wrong with TFSwiftFormerModelTest::test_save_load - AssertionError: 0.42552373 not less than or equal to 1e-05 but I have come to no conclusion yet.

There is also this current error that might be due to some misnamed layers, but I am not sure: tests/models/swiftformer/test_modeling_tf_swiftformer.py::TFSwiftFormerModelTest::test_pt_tf_model_equivalence - AttributeError: patch_embed.patch_embeddings.0.weight not found in PyTorch model.

Thank you!

joaocmd avatar Sep 24 '23 16:09 joaocmd

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Oct 19 '23 08:10 github-actions[bot]

@joaocmd Are you still working on this? If so, @Rocketknight1 could you help?

amyeroberts avatar Oct 27 '23 10:10 amyeroberts

Hi @amyeroberts , I haven't made any changes since my last comment as I was stuck and had some other responsibilities. I would like to finish the issue especially because I believe it's very close to finishing.

joaocmd avatar Oct 28 '23 10:10 joaocmd

Hi, I'm sorry, I'm not sure how I missed your last comment - this is entirely my fault! Let me investigate the errors you were getting and I'll see if we can get this PR over the line.

Rocketknight1 avatar Oct 31 '23 14:10 Rocketknight1

Hi @joaocmd, I just took a look now. The cause of the errors is that weight names are not being matched up correctly between the saved checkpoint and the model. The reason for that is that, annoyingly, you need to pass name args to all the layers you create.

For example, in this block:

for i in range(len(layer_depths)):
    stage = TFSwiftFormerStage(config, index=i)
    self.network.append(stage)
    if i >= len(layer_depths) - 1:
        break
    if downsamples[i] or embed_dims[i] != embed_dims[i + 1]:
        # downsampling between two stages
        self.network.append(TFSwiftFormerEmbeddings(config, index=i))

Both TFSwiftFormerStage and TFSwiftFormerEmbeddings will need a name argument here. When setting the name args, we generally try to choose them so they match up with the PyTorch state dict, so that weights can be cross-loaded between frameworks. This means that in this case, the names would need to be network_._0, network_._1 and so on. The index should be the position of the layer in the self.network list, so that self.network[i].name == f"network_._{i}".

Hopefully after all the layers get name arguments correctly, weight loading and cross-loading should just start working! There'll probably be other issues after that, but we should be able to work through them, so feel free to ping me here if you can't figure them out - the core of the port looks quite good!

Rocketknight1 avatar Oct 31 '23 15:10 Rocketknight1

Hi @Rocketknight1, thanks a lot for your input! I printed out the transformed names of the pytorch and tensorflow weights and have a few questions.

First, I didn't find anything regarding the {name}_._{index} annotation. When looking at the state_dict I find names like encoder.network.0.blocks.0, but where did the underscores go?

As for the batch norm layers like in the TFSwiftFormerPatchEmbeddingsSequential:

        self.zero_padding = tf.keras.layers.ZeroPadding2D(padding=(1, 1))
        self.conv1 = tf.keras.layers.Conv2D(out_chs // 2, kernel_size=3, strides=2, name="0")
        self.batch_norm1 = tf.keras.layers.BatchNormalization(
            epsilon=config.batch_norm_eps, momentum=0.9, name="1"
        ) # FIXME: is this the equivalent momentum?
        self.conv2 = tf.keras.layers.Conv2D(out_chs, kernel_size=3, strides=2, name="3")
        self.batch_norm2 = tf.keras.layers.BatchNormalization(
            epsilon=config.batch_norm_eps, momentum=0.9, name="4"
        ) # FIXME: is the correct momentum

Looking at the pytorch state dict seems like we get:

"patch_embed.patch_embedding.0.weight": "patch_embed.patch_embedding.0.weight",
"patch_embed.patch_embedding.0.bias": "patch_embed.patch_embedding.0.bias",
"patch_embed.patch_embedding.1.weight": "patch_embed.patch_embedding.1.weight",
"patch_embed.patch_embedding.1.bias": "patch_embed.patch_embedding.1.bias",
"patch_embed.patch_embedding.1.moving_mean": "patch_embed.patch_embedding.1.running_mean",
"patch_embed.patch_embedding.1.moving_variance": "patch_embed.patch_embedding.1.running_var",

Whereas in the tensorflow version I only found:

  "patch_embed.patch_embedding.0.weight",
  "patch_embed.patch_embedding.0.bias",
  "patch_embed.patch_embedding.1.weight",
  "patch_embed.patch_embedding.1.bias",

Are those being associated somewhere else? Especially referring to the moving means and variances.

Thank you!

joaocmd avatar Nov 04 '23 16:11 joaocmd

I'm now getting tests/models/swiftformer/test_modeling_tf_swiftformer.py::TFSwiftFormerModelTest::test_pt_tf_model_equivalence - AssertionError: Tuples differ: (13, 7, 7, 220) != (13, 220, 7, 7).

@amyeroberts you mentioned this last time I asked you regarding model inputs/outputs:

This pattern means the model is written in the TF compatible NHWC format throughout, but all of our vision models accept and return images in NCHW.

My question is where should I do the transposition? Should I do a for loop after running the encoder and transpose each one of the hidden states? Thank you!

joaocmd avatar Nov 04 '23 18:11 joaocmd

@joaocmd Transposing the input (NCHW) to the TF compatible mode (NHWC) should be done in the ModelNameMainLayer class e.g. like here for resnet and then permuted back to NCHW format before returning

amyeroberts avatar Nov 06 '23 14:11 amyeroberts