cpython icon indicating copy to clipboard operation
cpython copied to clipboard

struct module documentation should have more predictable examples/warnings

Open smontanaro opened this issue 3 years ago • 9 comments

Documentation

The documentation for the struct module isn't explicit about what's expected of the various examples. Working through that in a PR...

  • PR: gh-99141

smontanaro avatar Nov 05 '22 21:11 smontanaro

+1 for the changes. The original text here, and the choice to use big-endian for the examples, dates from over 28 years ago. These days I suspect that it's rather rare that the endianness of the machine you currently happen to be working on is relevant to the data manipulation task at hand. As such, I'd consider it a best practice to always use a struct "sigil" as the first character in your format string, and if we follow that best practice in the docs then it'll help it propagate to struct users.

mdickinson avatar Nov 06 '22 10:11 mdickinson

Thanks @mdickinson. That suggests that I should do a bit more tweaking of both the text and the examples in my PR.

smontanaro avatar Nov 06 '22 14:11 smontanaro

I think some consensus seems to have been reached in this thread. In particular, most examples should be explicit in their layout definitions.

@cameron-simpson's suggestion sums things up nicely:

... an example with native byte order (no < or >) cannot work “as is” on all platforms. And further, having it “just work” on the commonest platform is actively misleading. I am AGAINST that.

I think the “just works” examples should all use < or >.

I think there needs to be at least one “native” example, and it should be prefaced clearly that this may well not work identically on a user’s machine because it is machine type (and compiler type) dependent.

And then it should be presented, with commentary.

I’d even advocate presenting the existing hhl example, with contradicting example outputs from different platforms. So:

keep the existing output, and explain the source platform and its unpadded behaviour add a current example (yours or any of mine) and explain its padding behaviour

smontanaro avatar Nov 06 '22 14:11 smontanaro

As I'm working through some examples (and trying to update the documentation text), I find myself confused by some of the behavior. Consider these four similar struct.pack() examples (on an Apple M1 processor):

>>> sys.byteorder
'little'
>>> struct.pack('qqh', 1, 2, 3)
b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00'
>>> struct.pack('qqh0q', 1, 2, 3)
b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'
>>> struct.pack('>qqh0q', 1, 2, 3)
b'\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x03'
>>> struct.pack('<qqh0q', 1, 2, 3)
b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00'

I would have expected the 0q suffix on the format string to force padding of the output byte string to always be a multiple of eight bytes, but that's not the case:

>>> len(struct.pack('qqh', 1, 2, 3))
18
>>> len(struct.pack('qqh0q', 1, 2, 3))
24
>>> len(struct.pack('<qqh0q', 1, 2, 3))
18
>>> len(struct.pack('>qqh0q', 1, 2, 3))
18

If I don't understand what's going on here, anything I write will be gibberish...

(Edit: update my expectation to be more forceful)

smontanaro avatar Nov 06 '22 18:11 smontanaro

I switched to my Raspberry Pi, which is a little endian 32-bit ARM machine. I get confusing (to me) results there as well.

>>> sys.byteorder
'little'
>>> sys.maxsize
2147483647
>>> struct.pack('llh0l', 1, 2, 3)
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'
>>> struct.pack('>llh0l', 1, 2, 3)
b'\x00\x00\x00\x01\x00\x00\x00\x02\x00\x03'
>>> struct.pack('<llh0l', 1, 2, 3)
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00'

If my reading of the documentation is correct:

To align the end of a structure to the alignment requirement of a particular type, end the format with the code for that type with a repeat count of zero.

the '0l' at the end of the format strings should force padding at the end of the byte string to that necessary for long (four bytes on the Raspberry Pi). It seems not to be working. What about after the 'h' but before another 'l'?

>>> struct.pack('<llh0ll', 1, 2, 3, 4)
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x04\x00\x00\x00'
>>> len(struct.pack('<llh0ll', 1, 2, 3, 4))
14

Again, I see nothing to suggest the growing byte string was padded out to a four-byte boundary before the trailing b'\x04\x00\x00\x00' was appended.

In general, the 0<char> format doesn't seem to do what the docs say it should. Looking at the code in Modules/struct.c, nothing jumped out at me that suggested it was handing that case, though I haven't convinced myself I've looked everywhere.

smontanaro avatar Nov 07 '22 15:11 smontanaro

So my understanding of what's going on is that for non-native packing and unpacking (e.g., your examples with < and >), alignment simply doesn't come into play at all (which is why it says "None" in the "Alignment" column) in the table here.

E.g., we get a simple non-padded 9-byte output from the following (on any machine, in theory, but here on macOS / 64-bit Intel)

>>> import struct
>>> struct.pack('>lbl', 1, 2, 3)
b'\x00\x00\x00\x01\x02\x00\x00\x00\x03'

For the native examples, it looks to me as though you're getting the padding that you'd expect.

mdickinson avatar Nov 07 '22 17:11 mdickinson

Possibly we could reword note 3 to emphasize the contrast with the "non-native size and alignment" comment in note 2? Something like:

3. To pad the end of a structure to the alignment requirement of a particular type when using native size and alignment, end the format with the code for that type with a repeat count of zero. See [Examples](https://docs.python.org/3.11/library/struct.html#struct-examples).

mdickinson avatar Nov 07 '22 17:11 mdickinson

Thanks @mdickinson. If I understand correctly, struct is working as it should, and users don't have complete control over endianness and padding. That's unfortunate, as that would seem to eliminate one of the stated data sources, network connections, since there's no guarantee such data would match your C compiler's struct layout.

smontanaro avatar Nov 07 '22 22:11 smontanaro

no guarantee such data would match your C compiler's struct layout.

More correctly, the programmer couldn't precisely control the format definition to match the network data layout.

smontanaro avatar Nov 07 '22 23:11 smontanaro

There's some control over padding for the non-native modes, via the x format code (which adds a byte of padding). E.g.,

>>> import struct
>>> struct.pack('>lb3xl', 1, 2, 3)
b'\x00\x00\x00\x01\x02\x00\x00\x00\x00\x00\x00\x03'

But I agree it's not ideal.

network connections, since there's no guarantee such data would match your C compiler's struct layout.

Wouldn't data streamed over the network be expected to follow a machine-agnostic format, in general? I'm not sure when one would expect it to match the layout for whatever the C compiler happens to be on your local machine.

mdickinson avatar Nov 08 '22 18:11 mdickinson

Thanks @mdickinson. Apologies for the late reply. Been away.

The longer this thread continues, the more I'm beginning to believe there are two distinct use cases.

  1. Map data into and out of the format dictated by the compiler used to build Python. This is typified by the @ prefix and zero-repeat format codes. Something like llh0l is meaningful in this context.
  2. Map data into and out of specific formats needed for other uses (like network communication). For this, @ and zero-repeat formats are useless and you need to resort to more explicit format characters (endianness, specific padding with x). Assuming my compiler wants longs aligned on four bytes and I need to send/receive big-endian data across the net, the above format string would be replaced by >llhxx.

Have I got that about right? If I'm starting to think about this in the right way, I think the documentation should be explicit about these distinct use cases.

smontanaro avatar Nov 11 '22 19:11 smontanaro

On 11Nov2022 11:11, Skip Montanaro @.***> wrote:

The longer this thread continues, the more I'm beginning to believe there are two distinct use cases.

  1. Map data into and out of the format dictated by the compiler used to build Python. This is typified by the @ prefix and zero-repeat format codes. Something like llh0l is meaningful in this context.
  2. Map data into and out of specific formats needed for other uses (like network communication). For this, @ and zero-repeat formats are useless and you need to resort to more explicit format characters (endianness, specific padding with x). Assuming my compiler wants longs aligned on four bytes and I need to send/receive big-endian data across the net, the above format string would be replaced by >llhxx.

Have I got that about right? If I'm starting to think about this in the right way, I think the documentation should be explicit about these distinct use cases.

I'm in full agreement here.

cameron-simpson avatar Nov 11 '22 21:11 cameron-simpson

I updated the PR and marked it ready for review, should anyone be interested.

smontanaro avatar Nov 12 '22 17:11 smontanaro

@smontanaro

the more I'm beginning to believe there are two distinct use cases [...]

I can't speak for other struct users, but that matches the way I think I about it.

mdickinson avatar Nov 13 '22 09:11 mdickinson

Thanks, looks like the docs PRs have been merged and backported!

hauntsaninja avatar Nov 29 '22 06:11 hauntsaninja