Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Speed of different settings/modes (best in machine cycle terms) 2

Status
Not open for further replies.

rasgar

Programmer
Aug 17, 2005
17
US
For quite some time now I have wondered the speed of different modes and which is the most efficient. First of all, is 32-bit any faster than 16-bit assembly (assuming you have the same system. Also, is there any more speed that can be gained from using real more segmented model, or does real and protected mode flat work the same, but with less complications. I wondered if the extra versatility of the segment registers would allow more effecient code. Please specify the difference of efficeincy in machine cycles of possible
 
You'll find tables of machine cycles in the back of any good assembly handbook, and they differ between protected mode and old-fashioned 16-bit. If in protected mode, many instructions differ according to the privilege level of the code.
But machine cycles are only a part of the story. Anything dealing with any hardware (which includes system memory!) can be held up by the hardware, and in extreme cases processor speed becomes completely irrelevant.
Further, the speed of an instruction in machine cycles is not a constant; it depends on surrounding code. Some instructions can pair up properly using the multiple pipes of a modern processor, others cannot. Some fast instructions, in some processors, will run quicker than the processor can retrieve data, so eventually data entry into the processor becomes limiting; but in real code there's likely to be a mix of fast and slow things, so this situation won't arise. It's a typical artificial problem that happens when you write special code to time a particular instruction by running it a million times in a loop.

There are other things that confuse the issue. Most people when they originally moved to 32 bit from 16 bit did so with new, faster hardware. Others recompiled old 16-bit applications with a new 32-bit compiler and were upset that everything got twice as large but not faster. This is grossly unfair; 16-bit data chunks are actually slower to handle in 32-bit world, because they are the wrong size, and every instruction to handle them requires a size override prefix, which costs handling time. If you're writing 32-bit code, it won't run fast unless it takes advantage of the 32 bits!

This is a large subject, and you're going to have to do quite a bit of reading around if you're genuinely interested. If you're looking for a quick-fix for a homework assignment, there isn't one.

I'd recommend Intel's information for programers and processor handbooks. Last time I looked (quite a while ago now) it was all available online.
 
I suppose I was rather open-ended in my question, so I'll see if I can elaborate a bit more. I'm mostly a self-taught assembly user, and I learned it with the ambition of creating and hacking chemistry software (quantum ect...). What I'm concerned about is the small amount of time that I have to build on my knowledge, so I want to program in, and only in, the fastest assembly.

So the condition would be that I need to know what can do the same task at the fastest speed (assuming the code is optimized code for each). Another condition would be that the task would be large and complex with many threads, such as rotating objects in 3-d while viewing geometrical/trigonomitral properties of the object as it is altered and so on.

I assumed that real mode segmented would have the advantage of using the segment registers, and that 16-bit software would run faster. What I need to know (disregarding machine cycles this time) is if the segment registers being unaccesable would criple the programmer, and what runs faster: optimized 16-bit, or optimized 32-bit code (if the code is dealing with high or low amounts of data).

I don't have much knowledge of the internals of the pentium chip, so I made the educated gues that 32 is more efficient with large data and 16 bit is faster with small amounts of data. If that is insufficient for a more straight-forward answer, then just assume the code rotates graphics in 3-d or thta it graphs an equation.
 
There are some ways to taggle your problem, but here are some points to think about first:

- using real 32 bit code needs a undelaying system, like
windows, linux, etc.
- the timings for 16 bit code and 32 bit code depend on
the processor you are using.
- things like caching are also good to think about.

If you use a 32 bit operating system it is allways, even
linux, taking time for background handling.
System like DOS do that to but don't have task's switching
but only interrupts to handle.

There are a lot of processor types that or have optimized
there timing for 16 bit code or for 32 bit code.
Also there is a great deal of speed difference between one
processor and the athor conserning commands like rep
and other string handling.
But string handling can be done by using simpel code so
you may not have to bother about that.

If your code is ill written caching and pipelining will not function properly and make the lot much slower then
without.

In the field that you are interrested in I think there is
a type of high/low level language that is used for years
and is prety fast if just on regurring calculations.
The name if it is FORTH and you can look it up on the
internet.

One last thing 16 bit code and 32 bit code can be used
together in one program under real mode, so you can use
the faster 32 bit arithmetic and still running under a
simple operating system.
There is some back-fall if memory above 1MB is needed because you have to call the memory handling driver to use it ( the driver runs under protected mode if called)

Succes, Tessa

 
If you have little time available, and many of the tasks you're doing are for standard scientific programing and things that are widely used (3d rotations etc.), you should be able to find libraries etc., whose code will probably be more efficient than yours (no offence! just meaning that you are writing in a hurry with a definite aim in mind).
Good high-level code beats bad assembler every time. And most C++ compilers write code better than most humans.

A really good scheme is to write in high-level language, and if you find a particular part of your code is taking a very long time, have a look (1) whether there is a better algorithm, (2) are there other tricks that might help (you're handling more data than you need, or you could cache results), and finally, if all that fails, (3) it might be worth writing an embedded assembler thing to handle just this particular piece of code.
 
I forgot to add: the segment registers haven't disappeared in 32-bit code, it's just that you can't put any value in them that isn't a valid reference to a chunk of memory. Frankly I've never found that a (new) limitation, because in my 16-bit code I'm invariably using ds and es to refererence memory anyway, and fs/gs are so unversatile.
 
Writing code for sentific calculations is, and will,
been done in high-level faster then in assambly.
The fiddling arround with de segment registers is up to
the programmer and the use of fs and gs will add some
extra byte's and execution time.

If you just want to make a "working" program in as little
time, use any high-level programming tool.

But since programming on a level dealing with repeating
a lot of (mostly add's and divide's) simple calculations
it is a good practice to loop around for some compact
assambly code.

But ok, if you have a lack of time, use what you know, not
what you have to learn.

Tessa
 
I need to point out that my time isn't so limited that I can't fiddle with code, just that assembly is a massive language and I prefer to simply learn most of the fast portions and details on how it works. I heard that not even Micheal Abrash, one of the best assembly writters out there, can stand up and say "I know assembly". So my time limitations are mostly in comparison to the massive amount of knowledge.

As for the high level language, I prefer to hack a lot of code to unite good aspects from many programs into my libraries of macros, as well as comile it with other libraries. Libraries, to me, seem just as good as a solution as a high level language, especially since the only other languages I know are VB, Java, and pretty rusty C++ (which can sometimes go slower than death). Code optimization is a concern, which is why I say there are great quantities of knowledge to be obtained, and I'm stubborn on always having to find out how things work (hence my passion for science).

Put simply, I want to know what does the most math (of different types) the fastest. I suppose you can assume that the program is for an intel pentium 4 processor or stronger.

And I suppose that this is a foolsish question for someone who knows about memory adressing, but what is the great difference between 16 and 32 bit that makes their speed different? Books I've read on assembly simply say they proccess different amounts of data and they don't explain further. Seems like everyone is just concerned with the instructions of assembly rather than how assembly functions.
 
I think that people are trying to tell you to write your code in a high level language, then run a profiler to determine where the bottlenecks are, then rewrite the bottlenecks using in-line assembler.

That is definitely a non trivial task.

The complexities of modern processors make cycle counting in assembler more or less irrelevant.

I used to do a lot of that to speed stuff up...
 
Cycle counting is not usefull if you forget that instructions go thrue two or more pipelines at ones.
The main speeding up is done by pairing instructions that
do not need a result from one or the other.

So:

- mov ax,bx
- cmp ax,10

takes 1 cycle extra becose the cmp has to wait for the result of the move.

- mov ax,bx
- inc cx
- cmp ax,10

takes the same amount of time but now you have done
some other extra instruction to.

The misty ways that 16 or 32 bit code is not explained has
to do with the fact that a segment register under real
mode just holds the data you have been putting in it.

In protected mode (that is the 32 bit mode people are
talking about) the segment registers holds a pointer
to a segment descriptor and on the background, normaly not
visable by the programme, the size and the starting address
and some info bits about what kind of segment its holding.

So begining with the 80386 there was a whole lot more
to do before the instruction referring to these segment
registers are executed.

Thats why designers inplement in the later processors more
and more caching blocks.

If you want to now the exact cycle count you now have to know what the processor was doing before this instruction
and if it has chached the current insctruction or it
has to get the instruction from the slower memory first.

Thats why the books, mostly written by people that don't
know the way things work in protected mode, never speek
about the timing deferance between real and protected mode.

But there is a simple way to try to find it out for
a routine by using the internal 64 bit counter that
just increments at the same speed as the processor is
tikking.

Read it before you enter your code and just before leaving
it. Subtract and you know the clock counts it takes to
execute.
Try it a few times (the first time your code isn't cached)
and you know more or less how long your code takes to
execute.

So this is for now, write to you later.

Tessa
 
I guess this is my next attempt to get a more narrowed down answer, so here goes. Assuming that you had the best transistors possible on 2 difefrent systems, and one of those systems was optimized for 32-bit and the other 16-bit assembly, as well as both of them having the most "effecient" operating system compatible with them, which one would have the fastest run-time of a program that rotates a 3-d object? I suppose the same condition would be applied with the different modes such as real mode flat.

Dang Im tired of the ambiguity of assembly knowledge. If anyone can come up with a question that seems to be what I'm asking, and answer it, I would be much obliged.
 
mix use of 16 and 32 bit register under plain DOS.

Tessa

p.s. I thought you wanted to know why?
 
I guess that just about answers my question. Although I don't know how to mix them, I'll find out. It does make me dissapointed that I will have to pretty much broaden all of my knowledge instead of being able to focus on a single mode. The most I can narrow it down to is NASM assembly. Gotta love open source code. Pitty that code optimization is going to be a pain with all the knowledge I have yet to aquire.
 
I suspect the answer to your question is that no one really knows anymore.

In the days of the 286 things were relatively simple.

Now with multilevel caches and out of order execution who can tell anymore?

That's why there are so many different compiler switches on C++ compilers... you fiddle about until 1) it's stable 2) it's faster.

Then 3) take the debug code out & watch it fall over in a heap.
rgds
Zeit.
 
(1) From your question a few posts back, I think you're interested independent of operating system and processor in what data size is fastest.

The most efficient system is the one that holds exactly the data you need, no more, no less. If you need 32 bits to remember the coordinates of the thing you are rotating (or you need the 32 bits to hold the results of a multiplication) then a 32 bit system would be best.

A larger data size wouldn't necessarily make things slower.

But you haven't mentioned floating point yet...

(2) You mention C++ running very slowly. If you have a piece of C++ code running much slower than you'd reasonably expect, then you probably have a bad choice of algorithm, or other bad bit of coding that needs sorting out before going to assembler. An optimised version of slow code is just going to be slightly less slow (if you're lucky).
 
O no not floating point for 3D rotating and resizing.
Just use a simple algorithem and you will not need them.

3D rotating, shadowing and resising must be done it
handling a lot off vectors, that when optimized, only
needs 32 bit adding and dividing.

If you want to know more about it, let me know and I'll
try the explain it.
It take a lot of type for me, that's why I hope you
allready know how the approce this problem.

Mixing 16 bit and 32 bit registers as as simple as using
al and ax.

- al is the low byte
- ax is the low word
- eax is the total register

This works for all general purpose registers, except that
there is no byte version for : si di bp and sp

Succes, Tessa
 
I'm deffinately getting a close and closer to the answer that I want. And c++ doesn't run very slow, it's mostly the java and vb (but you gotta appreciate their simplicity).

Also, I would like to run other mathematics in the background (for instance, electron cloud coodrinate algorithms and other quantum equasions) concurrently with the visual change. Assuming the most simplicity here however, what would graph planes or quadratic equasions the fastest? I assume 32-bit would. I would almost undoubtedly need to use floating point when it gets too complicated.

And thanks for your answers, they deffinately get me closer to what I want.
 
When I'm driving my car from A to B, I want to visit the point B on the other highway (concurrently;)...
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top