Clarify SYS_GET_CMDLINE return string format
The documentation for the SYS_GET_CMDLINE semihosting operation mentions that the operation "Returns the command line that is used for the call to the executable, that is, argc and argv". The return fields are then defined as:
field 1 A pointer to a null-terminated string of the command line. field 2 The length of the string in bytes.
It seems to me there are three interpretations of this definition:
-
field 1is supposed to contain the command string before argument splitting. When using a POSIX command string, the string is parsed in step, where field splitting (converting the single command string into a command-name string and list of command argument strings) is somewhere in the middle. It is unclear whether the string infield 1should be the raw, unprocessed command string or should already have gone through the processing steps before field splitting (or anything in between). For example: For the command./app.elf "hello $(echo world)"the unprocessed command string would be./app.elf "hello $(echo world)"\0, where the command string processed up to field splitting would be./app.elf "hello world"\0. Regardless of the level of processing, this is different fromargv. Field splitting and quote removal needs to happen on the returned string before it can be used as argv. -
field 1is supposed to contain a list of null terminated strings, concatenated together. Although technically a null-terminated string is not forbidden to contain null characters, this feels like stretching the definition offield 1. For example: For the command./app.elf "hello $(echo world)"field 1would contain./app.elf\0hello world\0 -
field 1is supposed to contain a list of strings, separated by spaces. This seems to be qemu's interpretation. For example: For the command./app.elf "hello $(echo world)"field 1would contain./app.elf hello world\0This form yields a null-terminated string without null characters. However, splitting it up back into the original arguments is ambiguous. This can be seen from the example, whereargv = {"./app.elf", "hello", "world"}orargv = {"./app.elf", "hello world"}or evenargv = {"./app.elf hello world"}could be correct argument vectors that would all yield the given argument string.
The examples assume POSIX commands, but I think it's trivial to see how Windows cmd, powershell, or any other command line spec yields similar situations.
I think some more clarity about what the format of the string returned by SYS_GET_CMDLINE is needed. The uncertainty on the format, in my view, defeats the purpose of standardizing the command in the first place, since it can only be parsed when making assumptions about the provider of the string (the host machine).
Personally, I think interpretation 2 (list of string separated by null characters) is the most simple and useful one. Since command names and argument strings cannot contains null characters parsing such a string back into a list of strings is trivial.
Good catch! I agree that the text "that is, argc and argv" is confusing and misleading.
The intention, and every implementation I've seen, is that SYS_GET_CMDLINE returns a single long string containing the whole command line. If an embedded program wants to contain the standard C main(int argc, char **argv) then it's responsible for splitting that single command line into argv words according to whatever rule seems sensible.
SYS_GET_CMDLINE does not return an integer usable as argc, or a list of strings ready for use as the elements of argv. That text should be changed to avoid making it look as if it does.
Thanks for the quick response!
The intention, and every implementation I've seen, is that
SYS_GET_CMDLINEreturns a single long string containing the whole command line.
This is actual what my question is about. The documentation does not define what "the whole command line" or "the command line" means. My original comment lists some interpretations I could come up with for that term could mean. Each interpretation has different consequences for what an embedded programs might be able to do with the received string.
For example qemu currently passes just the command name and arguments separated by spaces as"the command line" string, which is different from what POSIX would define as the command line string.
POSIX doesn't have any definition of a single-string command line at all. In POSIX, the command line is communicated across each exec system call as a list of separate NUL-terminated strings, so that the declaration of main() as taking an argv array reflects what's truly going on at the OS level. In a POSIX context, the only time you see a program's command line in the form of a single string is if it's input to a shell, which has to split it into argv words before it can set up the exec that runs the actual subprogram. (But the rules it uses for doing that vary between shells, and also, are interleaved with lots of other processing.)
On the other hand, on Windows, the command line is communicated across a CreateProcess Win32 API call as a single string, which the subprocess can retrieve still in its single-string form via GetCommandLine. So a single string is the native form of the command line. In a console-subsystem executable, the typical crt0 code will retrieve that command line and split it into argv words, so as to comply with the C standard which requires the arguments to main() to be in that form. But it's not necessary to split the arguments at all: an application is also welcome to keep the command line unsplit, or to split it according to conventions unlike the default crt0 ones. And some do, because the splitting done by crt0 loses information, and not every application is happy to lose that information.
The semihosting API follows Windows's convention in this respect. The command line passed across semihosting is a single string, with a single NUL terminator at the end. Questions of its semantics are left to each application to define.
If you have a tool like qemu that accepts a POSIX argument list for the semihosted program and needs to translate it into a single string, then the semihosting specification takes no position on how that should be done. I think a command-line interface for a tool like that ought to provide some way to specify the whole command line as a single string, because that's the most precise form you can specify if in. But if it also chooses to accept multiple POSIX argv words and glue them together in some way, how it does it is outside this specification.
A tool like that on Windows would surely do better to get the semihosting command line directly from the whole-string command line passed to the Windows tool – trying to recombine words from the argv generated by its crt0 would lose a lot of precision that it could have avoided losing.
Thinking about it a bit more, it sounds as if what you're really after is a specification of the convention used for breaking up the SYS_GET_CMDLINE string into argv?
If every libc's startup code did that in the same way, then tools like qemu would be able to take account of it when constructing a command string out of their argv words, and quote the string in such a way that the argv received by the semihosted program's main() contained the same words as the ones qemu had received on its command line, without any corruption in between, even if the argument words contained difficult characters like spaces or quotes.
Unfortunately, libc implementations don't agree on a standard convention for this. For example, picolibc simply breaks up the command line at spaces, with no quoting system at all, so that you just can't get a space to appear in the middle of an argv word. On the other hand, Arm Compiler 6's C library implements a quoting system using single and double quotes and backslashes, similar to POSIX in general but differing in details, and also optionally process I/O redirection specifications by deleting them from the command line and reinitializing the stdio streams.
So there's no convention qemu could follow that makes the same effect happen in applications using both of those startup routines.