websockets feature - avoid utf-8 decoding for text frames

just because it is supposed to be in utf-8, doesn't mean I prefer it in that form. specifically, my usecase, is giving the data to orjson, and passing it around as an orjson.Fragment().

Here are the documents for that use case.

https://github.com/ijl/orjson#deserialize https://github.com/ijl/orjson#fragment

looking at websockets code, if such a capability were to be implemented, it seems like we'd want to add an flag to WebSocketCommonProtocol() and then use it to force binary around the time it decides on whether to decode it or not, located here:

https://github.com/python-websockets/websockets/blob/main/src/websockets/legacy/protocol.py#L1053

I'd be happy to whip up a patch in case you would consider this feature request.

Jun 26 '23 05:06 toppk

I understand your use case and, indeed, you cannot do this with the current API.

For receiving frames, it would mean an API like websocket.recv(decode_text_frames=False). (Naming TBC.) Can you confirm that it's what you want? (Then, you get bytes in all cases so you cannot tell if it was a Text or Binary frame in the first plac; but you don't really care anyway.)

This raises the question of providing a symmetrical API for sending bytes (assumed to be valid UTF-8) as a Text frame. You didn't ask for this but I'd like to keep consistency between both sides.

Jun 26 '23 10:06 aaugustin

That would work quite well. I guess I misunderstood the code, because it looked to me as if the recv() method is decoupled from where the actual processing of inbound data (read_message()). The solution you propose would certainly be more flexible.

Jun 26 '23 19:06 toppk

just thinking about the send side, I think it really is less important. there aren't too many servers that are strict in what they accept, especially when they are expecting text. I think if we implement it for send, while the effect is the same (skip encode, skip decode), but the names of the options will be different, e.g: decode_text_frames=False for recv(), and send_as_text=True for send()

Jun 28 '23 02:06 toppk

Yes, we need to pick the names for both sides carefully and, ideally, consistently.

raw_utf8 is a name that could work for both sides. I'm not sure it's the best name we can find, though.

If we have two names, I'd like some symmetry e.g. using the words decode and encode.

Jun 28 '23 07:06 aaugustin

I'm finding myself in the same position, trying to send data encoded with orjson as a text frame even when it is provided to websockets in binary form.

Any chance this gets added?

Sep 20 '23 17:09 carlos-sarmiento

This will be added as part of the new asyncio implementation (#1332).

Aug 07 '24 06:08 aaugustin

The new asyncio implementations supports recv(decode=False), which is the original request here.

(Also recv(decode=True) for the opposite behavior.)

I'm not planning to work on the other features discussed above, notably send(), until someone has a use case.

Aug 07 '24 16:08 aaugustin