Some of the generated intrinsis input/output type for x86 instructions are wrong

Open xusheng6 opened this issue 1 year ago • 0 comments

I noticed this while working on the recent xed update. For example, for instruction lgdt [eax+0x10] (0f015010), xed gives us the input is a 48bit struct:

% obj/wkit/examples/obj/xed-ex1 0f015010
Attempting to decode: 0f 01 50 10 
iclass LGDT	category SYSTEM	ISA-extension BASE	ISA-set I286REAL
instruction-length 4
operand-width 32
effective-operand-width 32
effective-address-width 32
stack-address-width 32
iform-enum-name LGDT_MEMs
iform-enum-name-dispatch (zero based) 0
iclass-max-iform-dispatch 2
Nominal opcode position 1
Nominal opcode 0x01
Operands
#   TYPE               DETAILS        VIS  RW       OC2 BITS BYTES NELEM ELEMSZ   ELEMTYPE   REGCLASS
#   ====               =======        ===  ==       === ==== ===== ===== ======   ========   ========
0   MEM0           (see below)   EXPLICIT   R         S   48     6     1     48     STRUCT    INVALID
1   REG0             REG0=GDTR SUPPRESSED   W    PSEUDO    0     0     1      0        INT     PSEUDO
Memory Operands
  0    read SEG= DS BASE= EAX/GPR DISPLACEMENT_BYTES= 1 0x0000000000000010 base10=16 ASZ0=32
  MemopBytes = 6
ATTRIBUTES: NOTSX SCALABLE 
ISA SET: [I286REAL]

While it might not be easy to represent it as a struct, an acceptable compromise is to represent it as a 6-byte long integer. However, when we check the generated input/output type of the intrinsic, we see it gives something different:

case INTRINSIC_XED_IFORM_LGDT_MEMs64:
	return X86CommonArchitecture::cached_input_types[43];
case INTRINSIC_XED_IFORM_LGDT_MEMs:
	return X86CommonArchitecture::cached_input_types[18];

X86CommonArchitecture::cached_input_types[18] = vector<NameAndType> { NameAndType(Type::IntegerType(10, false)) };
X86CommonArchitecture::cached_input_types[43] = vector<NameAndType> { NameAndType(Type::IntegerType(8, false)) };

This is not easy to fix. Currently we are hooking into the process of xed's code generation to dump the relevant info (see https://github.com/Vector35/binaryninja-api/blob/dev/arch/x86/code_generator/README.md for details). This works for most of the instructions. However, for certain instructions, the required info seems not available during the generation (they do become available when we run the compiled binary, as shown up, though). I am not sure if I missed something or my implementation is flawed.

Ideally, we should be able to use xed's code assembling capacity to assemble one instruction for each of the iform, then decode it and harvest the input/output type info. However, this route is also not possible, since the xed's assembling is centered around iclass (one iclass includes several iform), so it is not possible to easily get one instruction for each iform

Apr 10 '24 12:04 xusheng6