python-xdis Detect and fix invalid variable names

This is my attempt at implementing the fix to the issue discussed in https://github.com/rocky/python-xdis/issues/58

It does seem to work but there are four things I would like to discuss:

I'm only considering the code.co_names list here. I'm not extremely familiar with Python code objects. Can or should I look at any of the other lists of names as well?
Is there a better way to modify the code object than the reconstruction I'm doing in code.py:683-700?
What should the default for warning and fixing variable names be? Here I went with True/True
Can/should I add tests for this. I have added three files 02_invalid_variable_name{1,2,3}.pyc containing variables named y = "Hello" # (code), for (identifier) and 0x (leading digit) respectively but should I add them to the test suite somehow?

Apr 11 '20 23:04 ZetaTwo

Definitely should have co_cellvars, co_freevars, co_names, co_varnames. But to be extra careful and safe I'd add co_name which uncompyle6 and others might use that in reconstructing source code.
Was answered elsewhere
For defaults, I don't like autofix, so warn by default, but don't fix by default. The disassemblers are free to change their defaults, but shouldn't do that here because this works with bytecode.
Yes, absolutely tests are great! Thanks! pytest testing is in the pytest directory.

Apr 11 '20 23:04 rocky

I want to go over this carefully, so it may take a day or so. Looks interesting so far. Thanks for undertaking.

Apr 11 '20 23:04 rocky

Great. Are all of them also suitable for repairing? Or are could they somehow be referencing external things?
How should I proceed with this point? Make something that works for now and then make a separate PR and go back and fix this or expand the scope of this PR to include that?
Makes sense, I'll set default for warning=True and fixing=False. I left it to True/True in the command line tool. Should I change it to True/False there as well?
I added tests which seem to work but it also seems they exposed another issue with point 2 as now the following test fails due to the co_flags value being changed for some reason. I don't really understand what's gonig on though.

$ python3 -m pytest
pytest/test_disasm.py::test_funcoutput[test_tuple2-disassemble_file] FAILED
...
E           # Stack size:        1
E         - # Flags:             0x00000040 (NOFREE)
E         ?                              -   ^^^^^^
E         + # Flags:             0x00000000 (0x0)
E         ?                               +  ^^^
E           # First Line:        4
E           # Constants:

Yeah, I'm done for today so feel free to look through everything in your own pace and get back to me with comments.

Apr 12 '20 00:04 ZetaTwo

I just had a chance to look try this purely from a user level. Here are some observations..

The thing that was the biggest surprise is that without any options when run on test/bytecode_3.7/02_invalid_variable_name1.pyc there was no warning but the invalid name in slot 2 of names was changed anyway. I would have expected that it would not change, or better, by default warn but not change the name. I see that you refer to this as item 3 above.

When the --warn-invalid-vars option is given, then there is a warning, but we don't capture which variable(s) were detected and what they were changed from.

Having now tried this as a user, it feels to me that

  --warn-invalid-vars / --nowarn-invalid-vars
                                  warn about invalid variable names
  --fix-invalid-vars / --nofix-invalid-vars
                                  fix the names for variables with invalid
                                  names

should be choice options, not boolean flags. Something like

--check = {none|warn|fix...}

And although this might not be of concern for obfuscated code, there are a number of other checks that might be done on the sanity of the bytecode in the future. In a really secure environment, it might be nice to know whether bytecode is well formed. For example, if you get the stacksize wrong and it is too small, Python will SEGV with no information whatsoever.

Although it might be hard to do, it is within the realm of possibility that malware might be able to craft bytecode in such a was so as to take advantage of this. I would be interested if someone has tried to do something like this.

As for your specific questions...

Are all of them also suitable for repairing? Or are could they somehow be referencing external things?

All of them are suitable for repairing. I'll try to create some bytecode with all of the kinds of flaws mentioned, to make it easier to understand what I'm talking about.

How should I proceed with this point? Make something that works for now and then make a separate PR and go back and fix this or expand the scope of this PR to include that?

Here's a suggestion. I've added this as a branch of xdis currently. Let's make branches off of this or fix directly on the branch as the need demands. I can make PR's against this branch in your repository. For example, expect one which contains additional invalid bytecode for testing.

When everything is settled I'll merge this branch into master. While I do appreciate the great work on this so far, based on having tried this, I don't think it's ready to go into master just yet. I think things are a little more complicated and involved than either of us had imagined initially. That said, I do think we'll get there.

How does this sound? If this doesn't work for you, then make a suggestion.

Makes sense, I'll set default for warning=True and fixing=False. I left it to True/True in the command line tool. Should I change it to True/False there as well?

Should be answered above. If this is not clear or you think otherwise, let me know.

... but it also seems they exposed another issue with point 2 as now the following test fails due to the co_flags value being changed for some reason. I don't really understand what's gonig on though.

~~I am not able to reproduce this problem when I run pytest. But i know that Flags: 0x00000040 (NOFREE) is the right thing and the other while not wrong is not optimal.~~

I see the CI failure with 3.8. I'll address this later.

Apr 12 '20 23:04 rocky

I think the test failure I am seeing is related to the fact that code types change between 3.7 and 3.8 yet again.

~~The remedy I'm going to work with is to do add the replace() method and the convert to local-type method (the latter being the more relevant to this particular failure) in order to fix.~~

The better remedy is to use a more portable code type, and then "#replace" isn't needed. (However I'll add that so as to be more compatible with 3.8+ conventions).

Apr 13 '20 15:04 rocky

This is to give you a head's as to what's going on in xdis and how this fits in.

A long overdue overhaul of xdis's portable code types is underway in branch feature/code-replace.

Independent of this, I realize that what should have been done in the PR is not use Python's type.CodeType but one of the portable variants that I am currently cleaning up.

The disassemblers and deparsers all can use the portable version just as easily as the native version.

When the refactoring of the code type is done, then this PR will have to be redone, although the changes caused by refactoring will be slight.

Apr 14 '20 17:04 rocky

When the refactoring of the code type is done, then this PR will have to be redone, although the changes caused by refactoring will be slight.

Great. Do you have any rough estimate on when you will do this? I think that if it is in any reasonable near future I might as well just hold off on this PR completely and then just update it using the updated portable type instead.

Apr 15 '20 12:04 ZetaTwo

I'll probably be done by the end of the week. As part of this I will be adding some small kinds of validity checks, such as for the types of various parameters.

After that there is some opcode stack manipulation tracking information that I want to add.

And then a new release which will be a major version bump release.

So my thought is that if this can hold off, I do the major release with some version checking. Wait for that to shake out a little, and then add this.

Your thoughts?

Apr 15 '20 12:04 rocky

Yeah that sounds good. As soon as the interface is set I can update this and include the other changes you have mentioned above and then you can just hold off release to whenver it's suitable.

Apr 15 '20 12:04 ZetaTwo

Hi again @rocky ! I thought about giving this another look. What is the status of the code object interface at the moment? Do you think the base is there for me to finish this work?

Oct 31 '21 15:10 ZetaTwo

Hi again @rocky ! I thought about giving this another look. What is the status of the code object interface at the moment? Do you think the base is there for me to finish this work?

Sure, I think so.

Nov 01 '21 06:11 rocky