The Power of SSE

In the beginning, there was the CPU.  It supported only integer operations.  Then came the FPU, which supported floating-point operations.  For a long time, that was all we had.  Then came MMX (which is commonly said to stand for MultiMedia eXtensions, but actually Intel won't say), which was back to the integer operations (and interfered with the FPU), but these integer operations were useful for 3D applications.  Then came SSE (Streaming SIMD Extensions).  SSE allowed both integer and floating-point operations, independently of the FPU.  SSE has undergone multiple revisions - SSE, SSE2, SSE3, SSSE3 (Supplemental SSE3), SSE4.1, and now SSE4.2.

SSE4.2 has been available for a while, and even Bochs has supported it for more than seven months.  I knew from talks with Intel what was going to be included in SSE4.2, and now I've finally had a look at the finished product.
Two instructions are particularly interesting: CRC32 and PCMPxSTRx. 
CRC32 in one instruction!  Compare that to six instructions with a loop.  The performance increase could be significant.  The instruction uses the CRC32/4 (also known as CRC32C) polynomial (0x11EDC6F41).  As a result, anyone who was using the CRC-32-IEEE 802.3 polynomial (0x04C11DB7) will need to recalculate their values, but that's easy.  Maybe we will finally see people using 32-bit CRC32 routines, instead of the 16-bit copy-and-paste efforts because they didn't understand what's going on.
Of course, the downside is that the VX guys will probably be able to take advantage of it before AV guys can.  That's not as bad as it sounds - a virus that relies solely on that instruction will run only on that CPU and nothing earlier.  Great, limited audience!  On the other hand, a virus that carries dual-case code (like Win32/Legacy did) will defeat the purpose of having the modern code, since it would require the use of CPUID (which we can fake, and thus force whichever code path that we want).  So a virus doesn't gain an advantage that way, except as a possible and short-lived anti-emulation tactic, if it contains a bug in the CPUID code (like Win32/Legacy did) and always chooses the modern path.
AV engines, on the other hand, have to run on all kinds of CPUs, so it will be long time before the lowest commonly-supported platform will be one that carries that CPU.  What we have right now works just fine.  We don't need to use this instruction, though we might choose to take advantage of it if it is available.
How about PCMPxSTRx?  Imagine a single instruction that can perform strchr, strcmp, strstr, or strtok, up to 16 bytes at a time, without repeats or loops?  That certainly beats REP CMPSB or the 32-bit masking method.  In fact, PCMPxSTRx is so powerful and complex, one might think that Intel engineers had to work overtime to find ways to return all of the results.  The registers have status bits returned in them, and the eflags are overloaded, too.
What's next?  How about a PIGFARMER instruction for Leander fans?  Or, given enough inputs, perhaps an instruction that can solve Sudoku, or something important like predicting the weather?  Some people might say that the importance of those should be reversed. 😉
- Peter Ferrie

Comments (0)

Skip to main content