pycstruct Encoding for C-struct syntax

Sorry to spam, but this makes me think a lot :-)

I have some proposal with the way encoding is handled with the C syntax.

struct person 
{ 
    char name[50];
    char single_char;
};

a single char should be decoded like an array. I think it is not the case right now.
i think there is no real need to support signed char/unsigned char. As i saw in the documentation Parse source code files, it's probably not a very good C programing style to specify the sign of a char. For that there is the byte type. which is an int8. I think char should only be used for characters.

I also saw that the library uses utf-8 as default. It is probably not a good idea. Such thing could be be latin1, or a lot of other stuffs. It's also possible that encoding stay unknown until the structure is read.

Thinking about that, and following the way packing is handled, maybe this could be generalized for encoding, like using an own pragma.

So what it could be switched in the middle of the description if it is needed.

#pragma encoding("utf-8")
struct person 
{ 
    char name[50];
#pragma encoding("raw")
    char single_char;
};

This said, i don't have such problems for now. So it's just proposals.

Feb 15 '21 21:02 vallsv

Hi,

The introduction of interpreting unsigned/signed char as an array of numbers instead of utf-8 strings was introduced after suggestion in following issue:

https://github.com/midstar/pycstruct/issues/11

I think this is a nice "hack" to tell pycstruct what type of char array you would like to use. Thus I do think that pramas in the source code is unnecessary.

I'm not sure what you mean with "a single char should be decoded like an array.". A single char is decoded as an int8 and an "unsigned char" is decoded as an uint8. Note that there is no type in standard C language called 'byte'. The standard type for a byte is 'char'.

I do agree that it would be good to also support older encoding schemes for legacy systems (note that utf-8 is more or less standard nowadays). To support this i suggest:

Add more "string types" in StructDef add method. For example 'latin1' or whatever.
Add an argument to parse_file and parse_str where you specify the default encoding (char_array_encoding). My guess is that one system use the same encoding all over the place. Here it should also be possible to turn off char_array_encoding (=None) to tell the parser to not generate any strings at all, only arrays of signed or unsigned bytes.

Feb 20 '21 15:02 midstar

Oups. I have probably mix some stuff together in my mind.

In fact i saw that a

char foo;

was deserialized as an int, while my C header documentation was expecting a char like E.

That is why I think it makes sens to decode it, but i understand it's not so easy if there is no dedicated type for strings/values.

Some set of extra configuration as you suggest sounds a good idea to know how to handle few cases, like which encoding to use, or how to handle a single char.

Feb 20 '21 18:02 vallsv