Short answer: because not all SSE and SSE2 instructions generate #UD exception, which is needed for emulation.
Keywords: SSE emulation, SSE2 emulation
$Id: sse-emulate.html,v 1.4 2016-03-28 17:39:30+09 kabe Exp $ (2016/03)
Note to hackers: the assembly in this page, is written in AT&T order. Destination is written on right side, conforming to standard UNIX assembler.
IA-32 CPU will generate an exception,
namely #UD (undefined) exception,
when encountering unknown instruction during running a binary program.
This exception boils down to SIGILL
signal in
UNIX-like operating systems.
The idea of emulating an instruction is, to hook this exception, do the right thing in the exception handler, and return to the main program. So it is essential that CPU generates exception on every instruction we want to emulate.
Most of the SSE and SSE2 instructions does generate the #UD exception, as noted on Intel's manual:
Protected Mode ExceptionsThis means that, if the CPU does not have
- #UD
- If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 0.
If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
If CPUID.01H:EDX.SSE2[bit 26] = 0.
sse2
flag
(in Linux, flags:
of /proc/cpuinfo
),
the instruction will generate an #UD exception.
Note that, mention of generating the exception when without SSE capability, was written after SSE was invented; thus, on pre-SSE CPUs they may not follow this, and how it responds on the unknown instruction is completely "implementation dependent" .
for example, an SSE2 instruction
66 0f ef c0 pxor %xmm0,%xmm0 ;in C, xmm0^=xmm0 (=0)is described as
so it should generate exception.
- #UD
- If CR0.EM[bit 2] = 1.
(128-bit operations only) If CR4.OSFXSR[bit 9] = 0.
(128-bit operations only) If CPUID.01H:EDX.SSE2[bit 26] = 0.
But most pre-SSE but MMX-capable CPU does not know this spec, and treats it as
66 (long prefix) 0f ef c0 pxor %mm0,%mm0 ;mm0^=mm0 (=0)and does not raise exception. What the CPU did was dead wrong, and we couldn't capture the fact.
Since SSE/SSE2 instructions appears in batch, there's a chance to emulate them in correct way, if the first SSE instruction generates exception, and emulate the following multiple SSE instructions including problematic ones.
But the pxor
in particular mentioned in the example
is mostly the first instruction in SSE code area, to
zero-out the register.
And we can't capture it.
Some instructions, are described by the manual that the instruction operates on MMX registers instead of SSE registers. For example,
66 0f fd c1 paddw %xmm1,%xmm0 ;add 8 pairs of 16-bit wordsis described in the manual as
So, by specification, non-SSE2 CPUs should treat this instruction as
- #UD
- 128-bit operations will generate #UD only if CR4.OSFXSR[bit 9] = 0. Execution of 128-bit instructions on a non-SSE2 capable processor (one that is MMX technology capable) will result in the instruction operating on the mm registers, not #UD.
66 (long prefix) 0f fd c1 paddw %mm1,%mm0 ;add 4 pairs of 16-bit wordswhich is also not we want.
Having SSE but non-SSE2 (Pentium III, Athlon XP) has similar problem, which seems not widely known, so any questions asking for SSE2 emulator for SSE processors doesn't have any answer replies (other than "upgrade your computer").
Wrongly operating on MMX registers has severe side effect; x87 FPU state is destroyed.
It is safe to mix SSE/SSE2 instructions and x87 FPU instructions, since they use independent registers and state. Some programs are known to do this.
But MMX registers are aliased to x87 FPU registers, so you cannot mix MMX operations and FPU operations. Intel manual says:
9.6.2 Transitions Between x87 FPU and MMX Code
When an MMX instruction (other than the EMMS instruction) is executed, the processor changes the x87 FPU state as follows:The net result of these actions is that any x87 FPU state prior to the execution of the MMX instruction is essentially lost.
- The TOS (top of stack) value of the x87 FPU status word is set to 0.
- The entire x87 FPU tag word is set to the valid state (00B in all tag fields).
- When an MMX instruction writes to an MMX register, it writes ones (11B) to the exponent part of the corresponding floating-point register (bits 64 through 79).
The result is, that non-SSE CPU wrongly operating on MMX registers not only just destroy MMX register, but also the FPU stack machine state.
movq
There is one instruction, causing severe problem;
the movq xmm/m64,%xmmn
SSE2 instruction.
f3 0f 7e 02 movq (%edx),%xmm0 ; xmm0 = *(u64*)edx
This instruction frequently appears as first instruction when copying a block of data in 8 byte chunks.
In the manual, this instruction is described to operate on MMX registers if CPU is non-SSE2 capable.
But in reality, most non-SSE2 CPUs handles this as
f3 (prefix) 0f 7e 02 movd %mm0,(%edx) ; *(u32*)edx = mm0Note the direction of assignment. The actual handling clobbers the memory pointed by
edx
, instead of reading from it.
No exception generated.
At least,
AMD K6-2, Cyrix M II, and Intel Pentium III (SSE-capable but no SSE2)
acts this way.
The fallback operation of old CPUs doesn't fall on safe side.
The spec says to do
0f 6f 02 movq (%edx),%mm0which has different binary opcode. I guess Intel guys discussed a lot where to cram this instruction in the already-crowded x86 opcodes.
If the CPU does not even know MMX, it should generate exception on all MMX/SSE/SSE2 instructions. And it does, so emulation is possible. Unfortunately this means only plain Pentium and i486DX could run such emulation.
movq
doesn't fall on safe side.