Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Faster than bit shifting macros?

Status
Not open for further replies.

RbgoWeb

Programmer
Apr 10, 2002
91
0
0
NL
I'm experimenting with macros that can read and write parts of basic types. Normal bit shifts and AND masks are used a lot. But I thought that with help of a little structure operants are only casted and assigned and things may run a little faster when loads of data need processing.
Can someone tell me if this approach would have higher performance than the usuall approach? Do type casts cost performance?

typedef struct _B2 { BYTE _0, _1; } B2;

//packs two bytes into a word
#define B2PK(w, b1, b0) { (((B2 *)&w)->_1 = (b1)); (((B2 *)&w)->_0 = (b0)); }

//Reads words
#define B2R0(w) (((B2 *)&w)->_0)
#define B2R1(w) (((B2 *)&w)->_1)
//Writes words
#define B2W0(w, b0) (((B2 *)&w)->_0 = (b0))
#define B2W1(w, b1) (((B2 *)&w)->_1 = b1))


 
Bitwise operations on integers are extremely fast as they have a direct corresponing operation in the CPU, I don't think you'd gain much by trying to optimize em by restructuring (but I might be wrong, its just a feeling I have).

Either just have your macros do the shifts and &'s, or
perhaps consider using inline assembler.

Not sure if it will be a real improvement though, so the best way is to try 'em all measure the performance ;-)


Some assembler code snippets:

Value is returned with the a register, just ignore (or disable) warning C4035: no return value

Code:
inline _int16 Byte2Word(_int8 b1, _int8 b2)
{
  __asm
  {
    mov ah, b1  ; put b1 in the high part of ax
    mov al, b2  ; put b2 int the low part of ax
  } 
}

inline _int8 Word2ByteHi(_int16 w)
{
  __asm
  {
    mov ax, w  ; ax = w
    shr ax, 8  ; ax = ax >> 8
  }
}

inline _int8 Word2ByteLo(_int16 w)
{
  __asm
  {
    mov ax, w  ; ax = w. We just return the low part anyway.
  }
}

/Per
[sub]
if (typos) cout << &quot;My fingers are faster than my brain. Sorry for the typos.&quot;;
[/sub]
 
As usually, byte picking is slower on fast Intel processors. Byte-basis ops use wide (word) memory access and more processor cycles. For example, with my Pentium IV proc a byte assignment of high byte of unsigned short via byte-byte struct pointer is slower than shift and pack with shorts. If your fields are not adjusted on byte boundaries, special struct pointer tricks are useless, but shifts works fine. C-style struct bit fields was deprecated right from the start. Indirect access via pointers can suppress compiler optimization, too.
Moral #1: it's a very delicate area, especially with fast pipeline processors. As usually, data overlay tricks are useless, but sometimes... Verify simple code fragments with profiler or directly print clock() before and after 10e8 loops (in debug mode, no optimization, of course).
Moral #2: Programming is too hard work to play with byte/bit picking...
 
Hm, not sure I agree totally.
(1) Use of profiler is excelent advice.
(2) Use of disassembler is essential. You must know what your code is actually compiling to if you want to do this sort of stuff.
(3) Profile with care. Running the same instruction umpteen million times in a row is about all you can do in timing, but it's not representative of real code. Different instructions run at different speeds according to their context. For instance, (a) if a short instruction gets paired with a long one, the overall time to process the two, paired, instructions, is that of the long one. (b) a quick instruction may have to wait for the result of the previous instruction. In assembly coding of time-sensitive stuff it is normal to interlace instructions so that adjacent ones do not use the same registers.
(4) This is where I am not sure I agree with ArkM: something which works entirely with bytes should be quick, even on a 32 bit system, because instructions to read and write single bytes do not need any special op code prefix. Something that uses a word, at any stage in the process, will take longer, because of the size override prefix, which requires extra work of the processor. Naturally 32 bit operations don't need the prefix in 32 bit code, and are fine. So the things to avoid are words.
(5) Proper shifts do take time. But if your compiler notices you are shifting by multiples of 8 bits, it may not be encoding this with a shift instruction. But even so, shifts aren't that slow.

But having said all that ArkW is utterly right about that profiler, and this is a fascinating area....
 
All these sample macros take a WORD (w) as basis to work with. The structure exist only for use by the macros, and works as a map inside the WORD datatype. The macros kinda overlay this map over a WORD datatype so they can access the two smaller BYTE datatypes inside directly; that is without shifting and masking.

My technical question is simply... Will the macro statement:

B2W1(wValue, 0xEF); //Set wValue's hi byte

translate to one single MOV instruction?
Or is there a lot more that needs done in the background I don't know about?

I'd like to use an array its long type to store a hi-word and a lo-word in each element, and use macros to access them. But next to reliable, I'd like the access ofcourse to go as fast as it possibly can (even if it is only measurable=).

In the beginning tiny sample, the performance gain is like nothing, but still, trading a SHL and an AND in for one MOV is probably an improvement of 50% that could make just that little difference with large batches.

Guess I will go test/benchmark with the macros soon. I will deposit the results in this thread.
 
>translate to one single MOV instruction?

1. Set a breakpoint.
2. Open the disassembly window.
3. Look for yourself
4. ;-)




/Per
[sub]
if (typos) cout << &quot;My fingers are faster than my brain. Sorry for the typos.&quot;;
[/sub]
 
Here are the macro test results.

I have turned the compiler optimizations of so useless iterations will actually be performed.
I have run the test program many times; the result are about the same.

UQUAD: unsigned __int64
CAST: the macro uses a structure typecast and accesses the arg through its &quot;members&quot;.
BITS: the macro uses bit shifts and masking
[tt]
operation | method | iterations | elapsed
-------------------------------------------------------------------
Packing 2 DWORDs in a UQUAD CAST 250000000 4.06789337
Packing 2 DWORDs in a UQUAD BITS 250000000 8.10378668

Packing 4 WORDs in a UQUAD CAST 125000000 4.81222320
Packing 4 WORDs in a UQUAD BITS 125000000 9.36383734

Packing 8 BYTESs in a UQUAD CAST 125000000 4.03345564
Packing 8 BYTESs in a UQUAD BITS 125000000 19.5069047

Packing 2 WORDSs in a DWORD CAST 125000000 2.18505244
Packing 2 WORDSs in a DWORD BITS 125000000 2.02495027

Packing 4 BYTESs in a DWORD CAST 125000000 2.65914014
Packing 4 BYTESs in a DWORD BITS 125000000 2.78501665

Packing 2 BYTESs in a WORD CAST 125000000 2.27806661
Packing 2 BYTESs in a WORD BITS 125000000 2.02658050

Reading highest DWORD in a UQUAD CAST 125000000 1.77246778
Reading highest DWORD in a UQUAD BITS 125000000 3.54488102

Reading highest WORD in a UQUAD CAST 125000000 2.27775519
Reading highest WORD in a UQUAD BITS 125000000 3.30999119

Reading highest BYTE in a UQUAD CAST 125000000 1.77351922
Reading highest BYTE in a UQUAD BITS 125000000 3.54328096

Reading highest WORD in a DWORD CAST 125000000 1.51771045
Reading highest WORD in a DWORD BITS 125000000 1.74939842

Reading highest BYTE in a DWORD CAST 125000000 1.51951114
Reading highest BYTE in a DWORD BITS 125000000 1.77350254

Reading highest BYTE in a WORD CAST 125000000 1.51797419
Reading highest BYTE in a WORD BITS 125000000 1.77070386
[/tt]

I have also tested that the macros return correct data with positive results.


 
I'm afraid that __int64 shifts is not full representative. 64-bit integers are not native data type on 32-bit Intels. C++ compiler 'interprets' this 'shifts'. None the less, it's a very interesting result.
Structure mapping was a very effective tecnique on old 16-bit 286/386 low level C-ing. I think, now it is of no such importance. Additional addr registers involved with map ptr can overload Intel loop code (no large register pool), break pipelines and so on...
Well, good luck!
 
Thank you, ArkM. About the shift that is only interpreted: Does it also mean that when the PC and compiler would be 64bits machine, there might not be that big speed difference?

If you like to look at the testing code and/or test the *.exe on your machine, here is it:

In your first message:
&quot;Moral #1: it's a very delicate area, especially with fast pipeline processors. As usually, data overlay tricks are useless, ...&quot;
does this mean that these macros might not work on some processors? What is the best advice you can give me?
 
These values mostly make a lot of sense.
Casting should not cost anything, shifting and masking always will.
The one that makes little sense to me is why putting 8 bytes into a uquad is as quick as 2 dwords.
I think these data would be easier to interpret if you knew what they assemble into.

Note again, although I can't think of a better way to test the instructions, they may not run the same in context.

You'll notice that in general word accesses are slower than both byte and dword accesses. Obviously the type from which you read a byte doesn't really matter; so far as the code is concerned, you are reading a byte from somewhere in memory, and it is not relevant how long the data structure in memory happens to be. The only relevance would be alignment. Alignment is particularly relevant for word reads, where if the word is not word aligned, I think all intel processors need to make two memory accesses to get both halves of the word. But someone correct me if that isn't true.

Thanks for the intersting figures.
 
I think that one of the better ways to figure out which is faster is to use a high prescion timer which will calculate the speed of the operations involved. Michael Abrash wrote about this years ago - He called it the Zen timer I think.

Any rate the idea is that if you perform the same calculation a few thousand or so times and are able to write an alogrithem to measure the difference between two methods that would seemingly be more practical than trying to count bits.

I've never had good luck with profilers but it seems that as stated, it would be one of the better ways to go if you can get a meaningful difference in speed from it.

-TJ

 
Ah, now that raises an interesting quetion! I really hate to say anything against someone as good as M. Abrash, but when I tried his timer on my old pentium 75 it didn't actually work. It's possible I mistyped something, but I did go through it quite a few times checking. The problem is that I don't understand the timer as well as I'd like to, and didn't delve far into technical stuff on the PCs timing systems, so maybe I got something wrong.

Anyone else had problems with the Zen timer? I'm painfully aware, too, that PC compatibility isn't quite as good as it might be, and while I'm sure Abrash's timer worked spot on on every machine up to when he wrote it, maybe my PC was a bit weird??
 
I find most of Michael Abrash's early stuff
unnecessarily complicated, too concerned with bitcounts,
modeX and stufff like that. If you can take his ideas and find a way to use them, that's great - We all stand on the shoulders of those of his caliber.

There's several versions of the zen timer relating to the &quot;size&quot; of the program. I never had a problem with it, but it's been several years since I used it and never in windows or MSVC. I can it works correctly in Borland 3.1-4.5 in dos on W95 for sure.

a lower precison timer over a longer period of time might also work.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top