sdk-python [Feature Request] Raise exception for payloads that violate gRPC message max size

Is your feature request related to a problem? Please describe.

I ran into an issue where my code tried to send a payload that was larger than the 4MB supported by gRPC. The issue is that it wasn't so obvious that this was the issue. I only figured it out after reviewing my logs and seeing the following line:

WARN temporal_client::retry: gRPC call respond_workflow_task_completed retried 5 times error=Status { code: ResourceExhausted, message: "grpc: received message larger than max (10248653 vs. 4194304)", metadata: MetadataMap { headers: {"content-type": "application/grpc"} }, source: None }

For some reason this error did not raise an exception (which I would have seen in Sentry) and in the Temporal UI all I would see were timeouts. This made it take longer to understand the root issue.

Describe the solution you'd like

I think the default data converter could check for this case and raise a clear exception to help developers quickly understand the issue. Alternatively, maybe some error from gRPC could be propagated better, IDK about this but I'm sure @cretz will know best!

Here's what I added to my custom JSON encoder to add this behaviour for myself:

from typing import Any, Optional, Type

import orjson
import temporalio.api.common.v1
from pydantic.json import pydantic_encoder
from temporalio.converter import (
    JSONPlainPayloadConverter as BaseJSONPlainPayloadConverter,
)


MAX_PAYLOAD_SIZE = 4_194_304


class JSONPlainPayloadConverter(BaseJSONPlainPayloadConverter):
    def to_payload(self, value: Any) -> Optional[temporalio.api.common.v1.Payload]:
        serialized_valued = orjson.dumps(value, default=pydantic_encoder)
        if len(serialized_valued) > MAX_PAYLOAD_SIZE:
            raise RuntimeError(
                f"Payload size {len(serialized_valued):,d} "
                + f"exceeds max size {MAX_PAYLOAD_SIZE}"
            )
        return temporalio.api.common.v1.Payload(
            metadata={"encoding": self._encoding.encode()},
            data=serialized_valued,
        )

Dec 30 '22 17:12 nathanielobrown

But it's not just single payload that can cause this. What if you have two payloads that are each 2.5MB in the same gRPC call (very possible just calling two activities)? Or other things like string names or lots of little memo values that make it over 4MB? Also, I am guessing this is a server side error and therefore the server may be able to configure this (though we may not expose such a knob by default). Also, your solution doesn't help people using proto payloads or byte array payloads.

If the only purpose is to change English words in your specific case of single-payload-too-large from "grpc: received message larger than max" to "payload size exceeds max size" what you have is acceptable. But I am not sure we need to put this in the SDK if this is a server side error and can come about via many ways the JSON-payload-converter check you have won't catch.

Alternatively, maybe some error from gRPC could be propagated better,

How could the propagation improve? IIUC you're just changing the string message and stack trace, you're still gonna fail the workflow task and have to look at logs to see what happened.

Dec 30 '22 17:12 cretz

@cretz, I definitely don't know the best way to do this and I agree that the solution I used is not a "final" solution that would work for the SDK. I do feel like something could be improved here, but maybe I am wrong. I'll just try to detail what I'm experiencing.

The log message comes from the worker who is executing the workflow, although I would gather that it is just logging an error returned by the server.

What's important to me is not that we change the message in some way, but that we clearly fail whatever task we were trying to perform rather than retrying and timing out. The timeout means we have to embark on an investigation and cross-reference with logs, while an exception would make things more clear.

Maybe when a client talking to gRPC gets this specific error it can raise an exception rather than retrying and eventually timing out silently? If we receive an error response like this I would think a retry will never help fix the issue. I realize there are probably similar gRPC errors that should be retried and it might be a little in-the-weeds to try to isolate this (and potentially other unfixable errors) as a case for special handling, but I also think that I will not be the last one who runs across this issue and others may experience the same confusion.

Dec 30 '22 18:12 nathanielobrown

but that we clearly fail whatever task we were trying to perform rather than retrying and timing out

Ah, I see now. So when we make gRPC calls, we take some failure statuses to mean that there is a temporary issue and we should retry the call. Other failure statuses we treat as a problem with the workflow and fail the task (which is just like throwing from a workflow like you do here). ResourceExhausted is seen as a transient error even though it's not in this case. I will discuss with team to see if there may be a better way here. I think this is an issue in all SDKs today.

Dec 30 '22 19:12 cretz

This is pending https://github.com/temporalio/sdk-core/issues/462

Jan 13 '23 14:01 cretz