mslex icon indicating copy to clipboard operation
mslex copied to clipboard

CMD compatibility improvements

Open maxpat78 opened this issue 1 year ago • 2 comments

  • mslex version: 1.2.0
  • Python version: 3.13
  • Operating System: Windows 11 23H2

I'd like to bring to your attention following test cases I came across while developing my parser w32_lex.

With first 4, mslex differs completely from CMD method.

Next 4 differ only because mslex does not perform %variables% substitution, but it could.

Last 7 denote an approach that perhaps should be modified in mslex: a&&b etc. should isolate special operators like && in a && b, to give the user better control.

note: mslex splits ":dir" differently: [':dir'] instead of []
note: mslex splits ";;;a,, b, c===" differently: [';;;a,,', 'b,', 'c==='] instead of ['a,,', 'b,', 'c===']
note: mslex splits "^;;a" differently: [';;a'] instead of [';', ';a']
note: mslex splits "a "<>||&&^" differently: ['a', '<>||&&'] instead of ['a', '<>||&&^']

note: mslex splits "dir %a%" differently: ['dir', '%a%'] instead of ['dir', '!subst!']
note: mslex splits "dir ^%a%" differently: ['dir', '%a%'] instead of ['dir', '!subst!']
note: mslex splits "dir %%a%%" differently: ['dir', '%%a%%'] instead of ['dir', '%!subst!%']
note: mslex splits "a "<>||&&%A%"" differently: ['a', '<>||&&%A%'] instead of ['a', '<>||&&!subst!']

note: mslex splits "@a	==b c" differently: ['@a', '==b', 'c'] instead of ['@', 'a', '==b', 'c']
note: mslex splits "a|b" differently: ['a|b'] instead of ['a', '|', 'b']
note: mslex splits "a||b" differently: ['a||b'] instead of ['a', '||', 'b']
note: mslex splits "a&b" differently: ['a&b'] instead of ['a', '&', 'b']
note: mslex splits "a&&b" differently: ['a&&b'] instead of ['a', '&&', 'b']
note: mslex splits "a>b" differently: ['a>b'] instead of ['a', '>', 'b']
note: mslex splits "a>>b||c" differently: ['a>>b||c'] instead of ['a', '>>', 'b', '||', 'c']

Thanks for your attention.

maxpat78 avatar Oct 10 '24 13:10 maxpat78

Thanks for the report. It sounds like you are working on a much more complete parser of the windows command language than mslex. The goal of mslex is only to parse string literals correctly, and to accurately detect anything that is not a string literal. So performing %variable% substitution would be out of scope, but mslex.split should throw a MSLexError("Unquoted CMD metacharacters in string") if substitution would have been required.

For your examples, how did you determine what the correct answer should be?

For the first three examples, mslex appears to give the correct answer:

C:\Users\larry>python -c "import sys; print(repr(sys.argv[1:]))" :dir
[':dir']

C:\Users\larry>python -c "import sys; print(repr(sys.argv[1:]))" ;;;a,, b, c===
[';;;a,,', 'b,', 'c===']

C:\Users\larry>python -c "import sys; print(repr(sys.argv[1:]))" ^;;a
[';;a']

The fourth one does appear to be incorrect!

C:\Users\larry>python -c "import sys; print(repr(sys.argv[1:]))" a "<>||&&^
['a', '<>||&&^']
>>> mslex.split('a "<>||&&^')
['a', '<>||&&']

smoofra avatar Oct 11 '24 12:10 smoofra

I've tested various CMD versions issuing command lines directly "by hands", and with the help of a simple Windows app that shows the lpCmdLine untouched (details and C source code on my project's page). So I discovered that:

  • : (label operator in BAT files) makes the parser ignore the line introduced at the prompt;
  • one ore more leading ; ,= (=before first command) are ignored
  • ^;;a is interpreted (it's strange, ok) "invoke external program ; with argument ;a"
  • the 4th reflects a clear rule: the first non-escaped quote " starts a quoted block, and it ends: at next quote (that can't get escaped) or at line end, so the ending caret ^ has to be in the result - a "<>||&&^ -> [CMD] -> a "<>||&&^ -> [external CommandLineToArgvW] -> [a] + [<>||&&^]

maxpat78 avatar Oct 11 '24 13:10 maxpat78

Also don't forget shlex punctuation_chars option and cases like: note: mslex splits "copy c:\a c:\b&&b" differently: ['copy', 'c:\\a', 'c:\\b&&b'] instead of ['copy', 'c:\\a', 'c:\\b', '&&', 'b'] where the mslex result could seem counter-intuitive and wrong from the user perspective (where the CMD user does not mean copy to b&&b file)!

maxpat78 avatar Oct 14 '24 06:10 maxpat78

OK I understand what your saying about those examples now. CMD has rules that apply to the first word, the executable name, that do not apply to arguments. Heretofore, mslex has only concerned itself with arguments. It may make sense to add some flag to handle the first-word case. I'm not sure. Mslex is not meant to be a full implementation of the CMD language, but if there are ways of expressing a literal path to an executable that CMD recognizes, then mslex should too.

On to UCRT.

UCRT does indeed have a different behavior than msvcrt.

It does not exhibit the crazy modulo 3 periodic behavior of msvcrt.

This document seems to actually explain it clear enough to write a splitter for it:

https://learn.microsoft.com/en-us/cpp/c-language/parsing-c-command-line-arguments?view=msvc-170

There's enough of an intersection between the two that mslex.quote should be able to create quoted literals that will be parsed correctly by either C runtime. I'll have to add another flag to split to reflect the difference.

I also would really like to find a way to call UCRT's splitter from python. Right now I'm relying on collecting a giant table of examples in order to test. If you know how, let me know please.

smoofra avatar Oct 14 '24 14:10 smoofra

Look at https://github.com/smoofra/mslex/pull/14#issuecomment-2408409434

Modern UCRT C++ code does the same things old 2005 C code did, it seems.

maxpat78 avatar Oct 14 '24 15:10 maxpat78

This group is explicitly out of scope for mslex. It raises "Unquoted CMD metacharacters in string", which is all it promises to do on such strings.

note: mslex splits "dir %a%" differently: ['dir', '%a%'] instead of ['dir', '!subst!']
note: mslex splits "dir ^%a%" differently: ['dir', '%a%'] instead of ['dir', '!subst!']
note: mslex splits "dir %%a%%" differently: ['dir', '%%a%%'] instead of ['dir', '%!subst!%']
note: mslex splits "a "<>||&&%A%"" differently: ['a', '<>||&&%A%'] instead of ['a', '<>||&&!subst!']

smoofra avatar Oct 15 '24 13:10 smoofra

mslex appears to parse this correctly:

note: mslex splits "@a	==b c" differently: ['@a', '==b', 'c'] instead of ['@', 'a', '==b', 'c']
z:\src\mslex>.\tests\cmdline.py @a ==b c
{
 "GetCommandLineW": "\"C:\\Program Files\\Python312\\python.exe\" \"Z:\\src\\mslex\\tests\\cmdline.py\"  @a ==b c",
 "CommandLineToArgvW": [
  "C:\\Program Files\\Python312\\python.exe",
  "Z:\\src\\mslex\\tests\\cmdline.py",
  "@a",
  "==b",
  "c"
 ],
 "sys.argv": [
  "Z:\\src\\mslex\\tests\\cmdline.py",
  "@a",
  "==b",
  "c"
 ]
}

smoofra avatar Oct 15 '24 13:10 smoofra

and these are all out of scope too, they contain unquoted metacharacters

note: mslex splits "a|b" differently: ['a|b'] instead of ['a', '|', 'b']
note: mslex splits "a||b" differently: ['a||b'] instead of ['a', '||', 'b']
note: mslex splits "a&b" differently: ['a&b'] instead of ['a', '&', 'b']
note: mslex splits "a&&b" differently: ['a&&b'] instead of ['a', '&&', 'b']
note: mslex splits "a>b" differently: ['a>b'] instead of ['a', '>', 'b']
note: mslex splits "a>>b||c" differently: ['a>>b||c'] instead of ['a', '>>', 'b', '||', 'c']

smoofra avatar Oct 15 '24 13:10 smoofra

I'm not convinced it makes sense to add support for CMD's special rules about program names to mslex. It's not clear to me where I would draw the line between that and turning mslex into a full batch file parser. I also don't know where I can find any documentation of what CMD's rules actually are.

If you'd like to propose a way of doing it in a merge request, I'd consider it, but I don't think I'll be writing that feature.

Thanks again for the report

smoofra avatar Oct 15 '24 13:10 smoofra