Why MMX CPUs cannot emulate SSE

Short answer: because not all SSE and SSE2 instructions generate #UD exception, which is needed for emulation.

Keywords: SSE emulation, SSE2 emulation

$Id: sse-emulate.html,v 1.4 2016-03-28 17:39:30+09 kabe Exp $ (2016/03)


Note to hackers: the assembly in this page, is written in AT&T order. Destination is written on right side, conforming to standard UNIX assembler.

Long Answer

#UD Exception

IA-32 CPU will generate an exception, namely #UD (undefined) exception, when encountering unknown instruction during running a binary program. This exception boils down to SIGILL signal in UNIX-like operating systems.

The idea of emulating an instruction is, to hook this exception, do the right thing in the exception handler, and return to the main program. So it is essential that CPU generates exception on every instruction we want to emulate.

Most of the SSE and SSE2 instructions does generate the #UD exception, as noted on Intel's manual:

Protected Mode Exceptions
#UD
If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 0.
If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
If CPUID.01H:EDX.SSE2[bit 26] = 0.
This means that, if the CPU does not have sse2 flag (in Linux, flags: of /proc/cpuinfo), the instruction will generate an #UD exception.

Forward compatibility problem

Note that, mention of generating the exception when without SSE capability, was written after SSE was invented; thus, on pre-SSE CPUs they may not follow this, and how it responds on the unknown instruction is completely "implementation dependent" .

for example, an SSE2 instruction

	66 0f ef c0		pxor   %xmm0,%xmm0	;in C, xmm0^=xmm0 (=0)

is described as
#UD
If CR0.EM[bit 2] = 1.
(128-bit operations only) If CR4.OSFXSR[bit 9] = 0.
(128-bit operations only) If CPUID.01H:EDX.SSE2[bit 26] = 0.
so it should generate exception.

But most pre-SSE but MMX-capable CPU does not know this spec, and treats it as

	66			(long prefix)
	0f ef c0		pxor   %mm0,%mm0	;mm0^=mm0 (=0)
and does not raise exception. What the CPU did was dead wrong, and we couldn't capture the fact.

Since SSE/SSE2 instructions appears in batch, there's a chance to emulate them in correct way, if the first SSE instruction generates exception, and emulate the following multiple SSE instructions including problematic ones.

But the pxor in particular mentioned in the example is mostly the first instruction in SSE code area, to zero-out the register. And we can't capture it.

Explicit forward compatibility problem

Some instructions, are described by the manual that the instruction operates on MMX registers instead of SSE registers. For example,

	66 0f fd c1		paddw  %xmm1,%xmm0	;add 8 pairs of 16-bit words
is described in the manual as
#UD
128-bit operations will generate #UD only if CR4.OSFXSR[bit 9] = 0. Execution of 128-bit instructions on a non-SSE2 capable processor (one that is MMX technology capable) will result in the instruction operating on the mm registers, not #UD.
So, by specification, non-SSE2 CPUs should treat this instruction as
	66			(long prefix)
	0f fd c1		paddw  %mm1,%mm0	;add 4 pairs of 16-bit words
which is also not we want.

Having SSE but non-SSE2 (Pentium III, Athlon XP) has similar problem, which seems not widely known, so any questions asking for SSE2 emulator for SSE processors doesn't have any answer replies (other than "upgrade your computer").

Interference with x87 Floating Point Unit

Wrongly operating on MMX registers has severe side effect; x87 FPU state is destroyed.

It is safe to mix SSE/SSE2 instructions and x87 FPU instructions, since they use independent registers and state. Some programs are known to do this.

But MMX registers are aliased to x87 FPU registers, so you cannot mix MMX operations and FPU operations. Intel manual says:

9.6.2 Transitions Between x87 FPU and MMX Code

When an MMX instruction (other than the EMMS instruction) is executed, the processor changes the x87 FPU state as follows: The net result of these actions is that any x87 FPU state prior to the execution of the MMX instruction is essentially lost.

The result is, that non-SSE CPU wrongly operating on MMX registers not only just destroy MMX register, but also the FPU stack machine state.

The dreaded movq

There is one instruction, causing severe problem; the movq xmm/m64,%xmmn SSE2 instruction.

	f3 0f 7e 02		movq   (%edx),%xmm0	; xmm0 = *(u64*)edx

This instruction frequently appears as first instruction when copying a block of data in 8 byte chunks.

In the manual, this instruction is described to operate on MMX registers if CPU is non-SSE2 capable.

But in reality, most non-SSE2 CPUs handles this as

	f3			(prefix)
	0f 7e 02		movd   %mm0,(%edx)	; *(u32*)edx = mm0
Note the direction of assignment. The actual handling clobbers the memory pointed by edx, instead of reading from it. No exception generated. At least, AMD K6-2, Cyrix M II, and Intel Pentium III (SSE-capable but no SSE2) acts this way.

The fallback operation of old CPUs doesn't fall on safe side.

The spec says to do

	0f 6f 02		movq   (%edx),%mm0
which has different binary opcode. I guess Intel guys discussed a lot where to cram this instruction in the already-crowded x86 opcodes.

non-SSE, non-MMX CPU

If the CPU does not even know MMX, it should generate exception on all MMX/SSE/SSE2 instructions. And it does, so emulation is possible. Unfortunately this means only plain Pentium and i486DX could run such emulation.


Conclusion


References

Intel 64 and IA-32 Architectures Software Developer Manuals
The bible for every assembly programmers.
MMX Emulation Library Project
Attempt of emulating MMX instructions in userland.

kabe.sra-tohoku.co.jp