.NET Zone is brought to you in partnership with:

Sasha Goldshtein is a Senior Consultant for Sela Group, an Israeli company specializing in training, consulting and outsourcing to local and international customers.Sasha's work is divided across these three primary disciplines. He consults for clients on architecture, development, debugging and performance issues; he actively develops code using the latest bits of technology from Microsoft; and he conducts training classes on a variety of topics, from Windows Internals to .NET Performance. You can read more about Sasha's work and his latest ventures at his blog: http://blogs.microsoft.co.il/blogs/sasha. Sasha writes from Herzliya, Israel. Sasha is a DZone MVB and is not an employee of DZone and has posted 204 posts at DZone. You can read more from them at their website. View Full User Profile

Boost Performance with Data Parallelism: Using SIMD Instructions from C#

05.24.2012
| 3689 views |
  • submit to reddit

This is a short excerpt (with slight modifications) from Chapter 10 of Pro .NET Performance, scheduled to appear in August 2012. I might be publishing a few more of these before and after the book is out.

Theoretically, .NET developers should never be concerned with optimizations tailored to a specific processor or instruction set. After all, the purpose of IL and JIT compilation is to allow managed applications to run on any hardware that has the .NET Framework installed, and to remain indifferent to operating system bitness, processor features, and instruction sets. However, squeezing the last bits of performance from managed applications may require reasoning at the assembly language level. At other times, understanding processor-specific features is a first step for even more significant performance gains.

Data-level parallelism, also known as Single Instruction Multiple Data (SIMD), is a feature of modern processors that enables the execution of a single instruction on a large set of data (larger than the machine word). The de-facto standard for SIMD instruction sets is SSE (Streaming SIMD Extensions), used by Intel processors since Pentium III. This instruction set adds new 128-bit registers (with the XMM prefix) as well as instructions that can operate on them. Recent Intel processors introduced Advanced Vector Extensions (AVX), which is an extension of SSE that offers 256-bit registers and even more SIMD instructions. Some examples of SSE instructions include:

  • Integer and floating-point arithmetic
  • Comparisons, shuffling, data type conversion (integer to floating-point)
  • Bitwise operations
  • Minimum, maximum, conditional copies, CRC32, population count (introduced in SSE4 and later)

You might be wondering whether instructions operating on these “new” registers are slower than their standard counterparts. If that were the case, any performance gains would be deceiving. Fortunately, that is not the case. On Intel i7 processors, a floating-point addition (FADD) instruction on 32-bit registers has throughput of one instruction per cycle and latency of 3 cycles. The equivalent ADDPS instruction on 128-bit registers also has throughput of one instruction per cycle and latency of 3 cycles.

Using these instructions in high-performance loops can provide up to 8-fold performance gains compared to naïve sequential programs that operate on a single floating-point or integer value at a time. For example, consider the following (admittedly trivial) code:

//Assume that A, B, C are equal-size float arrays 
for (int i = 0; i < A.length; ++i) { 
  C[i] = A[i] + B[i]; 
}

The standard code emitted by the JIT in this scenario is the following:

; ESI has A, EDI has B, ECX has C, EDX has i 
xor edx,edx 
cmp dword ptr [esi+4],0 
jle END_LOOP 
NEXT_ITERATION: 
fld dword ptr [esi+edx*4+8] ; load A[i], no range check 
cmp edx,dword ptr [edi+4]   ; range check accessing B[i] 
jae OUT_OF_RANGE 
fadd dword ptr [edi+edx*4+8]; add B[i] 
cmp edx,dword ptr [ecx+4]   ; range check accessing C[i] 
jae OUT_OF_RANGE 
fstp dword ptr [ecx+edx*4+8]; store into C[i] 
inc edx 
cmp dword ptr [esi+4],edx   ; are we done yet? 
jg NEXT_ITERATION 
END_LOOP:

Each loop iteration performs a single FADD instruction that adds two 32-bit floating-point numbers. However, by using 128-bit SSE instructions, four iterations of the loop can be issued at a time, as follows (the code below performs no range checks and assumes that the number of iterations is equally divisible by 4):

xor edx, edx 
NEXT_ITERATION: 
; copy 16 bytes from B to xmm1 
movups xmm1, xmmword ptr [edi+edx*4+8] 
; copy 16 bytes from A to xmm0  
movups xmm0, xmmword ptr [esi+edx*4+8] 
; add xmm0 to xmm1 and store the result in xmm1 
addps xmm1, xmm0 
; copy 16 bytes from xmm1 to C 
movups xmmword ptr [ecx+edx*4+8], xmm1 
add edx, 4 ; increase loop index by 4 
cmp edx, dword ptr [esi+4] 
jg NEXT_ITERATION

On an AVX processor, we could move even more data around in each iteration (with the 256-bit YMM* registers), for an even bigger performance improvement:

xor edx, edx 
NEXT_ITERATION: 
; copy 32 bytes from B to ymm1 
vmovups ymm1, ymmword ptr [edi+edx*4+8] 
; copy 32 bytes from A to ymm0 
vmovups ymm0, ymmword ptr [esi+edx*4+8] 
; add ymm0 to ymm1 and store the result in ymm1 
vaddps ymm1, ymm1, ymm0 
; copy 32 bytes from ymm1 to C 
vmovups ymmword ptr [ecx+edx*4+8], ymm1 
add edx, 8 ; increase loop index by 8 
cmp edx, dword ptr [esi+4] 
jg NEXT_ITERATION

The SIMD instructions used in these examples are only the tip of the iceberg. Modern applications and games use SIMD instructions to perform complex operations, including scalar product, shuffling data around in registers and memory, checksum calculation, and many others. Intel’s AVX portal is a good way to learn thoroughly what AVX can offer.

The JIT compiler uses only a small number of SSE instructions, even though they are available on practically every processor manufactured in the last 10 years. Specifically, the JIT compiler uses the SSE MOVQ instruction to copy medium-sized structures through the XMM* registers (for large structures, REP MOVS is used instead), uses SSE2 instructions for floating point to integer conversion, and other corner cases. The JIT compiler does not auto-vectorize loops by unifying iterations, as we did manually in the preceding code.

Unfortunately, C# doesn’t offer any keywords for embedding inline assembly code into your managed programs. Although you could factor out performance-sensitive parts to a C++ module and use .NET interoperability to access it, this is often clumsy and introduces a small performance penalty as the interoperability boundaries are crossed. There are two other approaches for embedding SIMD code without resorting to interoperability.

A brute-force way to run arbitrary machine code from a managed application (albeit with a light interoperability layer) is to dynamically emit the machine code and then call it. The Marshal.GetDelegateForFunctionPointer method is key, as it returns a managed delegate pointing to an unmanaged memory location, which may contain arbitrary code. The following code allocates virtual memory with the EXECUTE_READWRITE page protection, which enables us to copy code bytes into memory and then execute them. The result, on my Intel i7-860 CPU, is a more than 2-fold improvement in execution time!

[UnmanagedFunctionPointer(CallingConvention.StdCall)] 
delegate void VectorAddDelegate( 
  float[] C, float[] B, float[] A, int length); 

[DllImport("kernel32.dll", SetLastError = true)] 
static extern IntPtr VirtualAlloc( 
  IntPtr lpAddress, UIntPtr dwSize, 
  IntPtr flAllocationType, IntPtr flProtect);

//This array of bytes has been produced from 
//the SSE assembly version – it is a complete 
//function that accepts four parameters (three 
//vectors and length) and adds the vectors 

byte[] sseAssemblyBytes = { 
  0x8b, 0x5c, 0x24, 0x10, 0x8b, 0x74, 0x24, 0x0c, 0x8b, 
  0x7c, 0x24, 0x08, 0x8b, 0x4c, 0x24, 0x04, 0x31, 0xd2, 
  0x0f, 0x10, 0x0c, 0x97, 0x0f, 0x10, 0x04, 0x96, 0x0f, 
  0x58, 0xc8, 0x0f, 0x11, 0x0c, 0x91, 0x83, 0xc2, 0x04, 
  0x39, 0xda, 0x7f, 0xea, 0xc2, 0x10, 0x00 }; 

IntPtr codeBuffer = VirtualAlloc( 
  IntPtr.Zero, new UIntPtr((uint)sseAssemblyBytes.Length), 
  0x1000 | 0x2000, //MEM_COMMIT | MEM_RESERVE 
  0x40 //EXECUTE_READWRITE 
); 

Marshal.Copy(sseAssemblyBytes, 0, 
  codeBuffer, sseAssemblyBytes.Length); 

VectorAddDelegate addVectors = (VectorAddDelegate) 
  Marshal.GetDelegateForFunctionPointer( 
    codeBuffer, typeof(VectorAddDelegate)); 
//We can now use ‘addVectors’ to add vectors!

A completely different approach, which unfortunately isn’t available on the Microsoft CLR, is extending the JIT compiler to emit SIMD instructions. This is the approach taken by Mono.Simd. Managed code developers who use the Mono .NET runtime can reference the Mono.Simd assembly and use JIT compiler support that converts operations on types such as Vector16b or Vector4f to the appropriate SSE instructions.

Published at DZone with permission of Sasha Goldshtein, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)