I have some code where I have a value x=2^n, and I wanted to see what that n was, knowing that there’s only one 1 in the binary representation, and it’s not a huge number. I know it’s not hard, but I was wondering if there was a better way, so I looked around. I ran into a post from someone (student?) who had to compute the same thing without branching, and that got me thinking.
I’ve now written this two times and found a few more ways, unrolled one of them by hand to see how much that helped (some, but not a whole lot), but here are the main four contenders: the naive way, the log-trick way, inline assembly language, and using DeBruijn sequence:
Code: (all were set inside an int func(int) wrapper)
- while((x>>=1)>1) y++;
- y=log(x)/log(2);
- asm(“fld1; fld x; fyl2x; fstp y;”) (wrapped with type conversions from integer to float and back)
- y=myDeBruijnArray[(unsigned int)(x*0x077CB531U) >> 27]; (where “myDeBruijnArray” is a lookup table 32 32-bit numbers)
Times: (for 10,000,000 iterations)
- 0.602, our base case for 16 bits. Counting 8 bits ran almost twice as fast, as one would expect, and I’d imagine that counting out 32 bits would take even longer.
- 2.098, which makes sense because we’re doing two function call to math libraries, but it met the condition of “no branching” that the one student had.
- 0.521 or 0.715 (see below), which gets us out of our conditional, but at the cost of using the FPU which has a suprising amount of overhead, and again, the number of bits makes no appreciable difference.
- 0.283, which has an “expensive” imul opcode, but that’s it.
One interesting thing about the assembly language:
I expected the function that was all inline assembly language and some space for variable to result in the smallest code, and it wasn’t. There were 14 more extra assembly language operations to convert between the unsigned int value I had to start with and the floating point values I needed for the assembly language and back again. The DeBruijn method used the standard headers and footers for the command that all of the versions used, and then four lines:
mov eax,dword ptr [x] imul eax, eax,77CB531h shr eax, 1Bh mov eax, dword ptr myDeBruijnArray [eax*4]
Because of conversion time, I wanted to try doing the inline assembly code, but instead of calling the routine with the integers I was using elsewhere, I changed from an integer to a floating point before making the call to the routine, leaving the assembly language version strictly in floating point numbers, and it turns out that using assembly language with all floating-point values saved enough time for assembly language to be a hair faster than the naive approach, and of course if I had to use floating points anyways, it’s a *lot* faster than calling the math libraries. It turns out that I could even drop the last “store my FPU register in Y” and leave it on the top of the stack for the return value, which is what “return y” does anyways, so now that function looks like this in assembly language,
fld1 fld dword ptr [x] fyl2x
They take longer to run than the op codes we needed for the DeBruijn code, but that code *only* works for integers, and knowing that there’s a log2 function in the processor that’s not in the standard library cuts my time by 75%, which if I had to do a lot of it (data mining? Video processing?) could be very significant. Also, use integer variables instead of floating-point variables when you do not need the latter.