should code and data be separated? 2

lionelhill · Feb 22, 2004

Last week I saw an article (New Scientist I think) about hardware solutions to the buffer overflow weakness (as exploited by hackers etc. to gain control of other people's computer equipment).
The idea seemed to be to separate code and data by hardware means, such that memory containing code could not physically be written to (in this case by an overflow from a preceding block of data).

Does anyone else, like me, feel that we're losing a baby with the bath-water? One of the greatest realisations in the history of computing was surely that code is just a specialised form of data, and not a distinct special thing. And now we're undoing that realisation?

And incidentally, what's the difference between self-modifying code and "just in time compiled" code? I admit self-modifying code is a speciality interest found in only a few special applications, but it's still one of the useful tools available to the constructive programmer.
I'm interested in other view-points and would greatly appreciate finding out more from anyone who has time and knows a bit more than I gleaned from a short article.

Dimandja · Feb 22, 2004

Yes, we would be losing a baby with the bath-water.

Attempts to dumb down computers in order to defeat hackers are the dumbest initiative we could take.

First of all, there is no guaranty that it would work. Second of all these pests will simply find a new way to be annoying.

Lets learn from other branches of science: biology for instance, where a battle is raging against, what else, virii (viruses).

Dimandja

jsteph · Feb 22, 2004

The nature of what gets into the CS:IP register is such that it would take a profound hardware change to meet the spec of separating code from data. Intel and AMD won't go for it, and I doubt the market would either.

And as noted, it may not work. It's not going to happen--too much code out there to be recompiled and there are other ways to defeat hackers. Not the least of which is demanding an OS written by something better than the gang of mediocre-at-best coders from Redmond.
--jsteph

dilettante · Feb 22, 2004

Von Neumann, Harvard, and other machine architectures have been around for a long time.

Another architecture that is a sort of "hybrid" of the two in many ways is a tagged-memory architecture. Tags can be used to "type" memory at a hardware level, and an executeable type is just one of many that can be supported. There are lots of these around, and they've been in regular commercial use for many decades.

As a matter of fact I've heard that both Intel and AMD are looking at this for their 64-bit processor refinements. Microsoft has even talked about support for tagged-memory machines in future releases of Windows. Intel's 80960 used tagged memory for example, so we know this concept isn't new to the microprocessor world. I wouldn't be surprised at all to see things move in this direction.

Pretty old hat stuff, and long overdue.

I doubt we'll be seeing much in the way of true Harvard Architecture machines produced for general use in PCs though. That is, outside of DSPs where it is very common for things like audio and video processing.

Geeze, what do they teach in Computer Science anymore if they don't cover this stuff?

CajunCenturion · Feb 22, 2004

You're absolutely correct dilettante, tagged memory architectures have been around for quite a while. In a similar vein, ringed architectures were in use almost 30 years ago. With respect to your last question, I think most accredited CS degree programs do cover these topics. It goes to show how few of those who in the IT professional have formal education in IT, and thusly, know about IT. This is one of the many things that they don't even know that they don't know about. After all, no vendor, at least that I know of, offers a certificate in Computer Architecture. The two year Associate degree programs simply don't have the time to address this topic, nor any of the other advanced topics in CS that you'll discover at the junior and senior levels of a quality program.

The difference between "just in time" compiled code and self modifying code has to do with when compilation takes place, and what is actually executed. When a program is compiled has no bearing on the instruction set produced by the compiler. With the possible exception of some optimizations, whether a program is compiled one day, one week, or one second before execution does not change the output of the compiler. That is the program instruction set. A self-modifying program is one which during execution alters it's own instruction set after compilation and during execution.

Good Luck
--------------
As a circle of light increases so does the circumference of darkness around it. - Albert Einstein

lionelhill · Feb 23, 2004

Thanks people; I admit my knowledge is very shaky. IT is just such a big field and many of us aren't really IT people at all, just users up against problems for which solutions aren't available off the shelf. A lot of what we learnt came from playing about with a ZX80 or whatever, and there wasn't much complicated memory architecture in that (i.e. we're more familiar with arguments between cpu and video over who owns the screen memory!).
Incidentally I remember a hideous discussion I had with a genuine, trained programmer who felt I was totally unreasonable to expect her to know about how a linked list is implemented: "that stuff's for amateurs. Real programmers don't have time for that sort of thing. You just use a library". Hm, maybe she's right, but I couldn't live that way myself.

Cajun, one of my reasons for asking about "just in time" code was an amateurish uncertainty about how anti-hacker measures controlling memory use would deal with that. Code that's being compiled is presumably data, and overwriteable by data operations, and therefore hackable by buffer overflows etc. (provided you hit the little moment before it's declared done and now code, not data)
What I've never understood is that I thought this sort of memory "buffer overflow" was what the protected mode of the 386 and upwards was supposed to stop already??

dilettante · Feb 23, 2004

Yes, I foolishly overreacted to what looked like a lot of negativity. My mistake, I should be wary of slinging "slashdot" myself.

There are a lot of places in most "386" operating systems where buffer overflows can occur. One of the most common is in things like device drivers operating in kernel mode. Others include global heaps, stackpool buffers, and code written to permit self-modification. ;-)

Part of the problem is that i386 processors have some pretty baroque memory models available. Since i386 memory segmentation was difficult to deal with (especially for the assembly-level application programmer, but also for compiler/linker writers) - a "flat" mode was embraced with open arms. I may be out of date on this, but at one point Linux completely eschewed everything but the flat mode, and did not use 386 protected mode at all for user programs and most system functions.

Many ringed architectures can fall into a trap when bugs in context-switching code grant elevated privileges to user code as well.

Sadly, I haven't had the luxury of looking at these matters myself in... wow, at least 7 years. Amazing how your brain can rust up when you don't use something.

harebrain · Feb 23, 2004

"that stuff's for amateurs. Real programmers don't have time for that sort of thing. You just use a library".

Which would make all the library authors amateurs... Yikes!

bytehd · Feb 23, 2004

Didnt you get the memo?
Code and Data are separate on Intel Chips:
Its called Protected Mode.
And no program or OS has crashed since the 80286 was
released in 1982 with the infamous MOV CR0 instruction!

start discussion as to why MS code still crashes and year after year more code is put INTO ring 0......

Linus, Drew: make it efficient!
Gates: cache it!

lionelhill · Feb 24, 2004

Thanks again, especially diletantte for explanation! Anyone know any reasonably accessible introductions to this sort of thing for the uninitiated but interested? I'd like to know more for my general education, though I have to admit it's not really relevant to my work (i.e. most of us just use the the memory model provided and have been rusting up ever since we stopped having to decide whether to go 'tiny' or 'small' let alone 'huge')?

BNPMike · Feb 24, 2004

Data should be separate from code. It was thus in the old days when I started programming (eg DEC System 10). You put your code in one segment and your variables in another. If anything tried to write to the code segment the operating system captured this as an exception. Computers were multi-state so that only the operating system could carry out certain instructions eg allocate memory outside your currently allocated segment. If a virus tried to interfere with another process, it would be bumped within trilliseconds.

Modern chips like Pentiums are extremely sophisticated so i can only assume it's Bill Gates' craven naivety that has led to this mess of viruses and so on that is quite incredible to me. There is no need for it whatsoever.

DEC of course no longer exist. Isn't life strange...

dilettante · Feb 24, 2004

Hmm... it's tougher than I tought to find much on processor architecture. A lot of the information on the web consists of book reviews, lecture descriptions, or adverts.

Part of it may be that computer architecture at this level is now the province of "the back room boys" at mega-corporations. I suspect that both the commodotization of processors and their reduction to micron-level traces in silicon means that a lot less research is done in universities any longer: bigger fish to fry on the one hand, and too hard and expensive on the other.

Here are a very few links though. Finding much at an entry level might be hard, so one may need to dig up more "historical" items first in order to find a context to understand the others in. maybe these will provide other keywords to do your own searches on:

Processor Architecture

Lecture: EPIC, IA-64 and Merced

“PIC”king the right ?C

John von Neumann and von Neumann Architecture for Computers (1945)

Microcontroller Design Tradeoffs

BNPMike · Feb 24, 2004

Well, well! Apparently we are just about to get memory protection from Billy Boy, despite the fact Intel etched it in from the 386 onwards!

http://www.zdnet.com.au/news/business/0,39023166,20274445,00.htm

bytehd · Feb 24, 2004

Hey dilletant
theres LOTS of CPU Arch stuff out there.
check the Intel Developer site

lionelhill · Feb 25, 2004

Thanks for the suggested reading, all.

(1) About hardware solution to virus attack (buffer overrun weakness)

I'm still a bit sceptical about how effective code and data separation would be if people expect a hardware solution. BNPMike's answer makes the point that in the end the operating system decides who writes where. All code is data briefly when it's being put into memory (unless you want to have machines where programs are bought on rom and plugged in. With modern huge addressing spaces it's a possibility...) Unless the operating system is on the ball, progams will always be vulnerable when first being loaded into memory.

And if the operating system IS on the ball, then there's no need to enforce code/data separation. There is only a need to enforce separation between different processes. If a particular process decides to write over its own code, good luck to it. It can't affect anything else.

It remains the application software's job to be "on the ball" itself and not allow itself to overwrite its own code with viri. Surely that's possible? What sort of amateurish programmer fails to check a block of memory is big enough for what he/she intends to write?

The operating system cannot, in any case, prevent application program viri. It can't do anything about an interpreter handling a macro language! A macro virus in application prog. can be written without recourse to writing in a code segment.

(2) Self-modifying code.

But if you do ban writing to code segments you lose some good things too. Imagine the situation where you need a shift instruction shifting a variable number of bits. It might be drawing an image in a bit-planed memory map (e.g. 16-colour graphics on VGA for a simple, rather old example). You've passed the necessary parameters to an assembly procedure/function which now calculates the necessary shift. But you need to use this shift a bit further down inside a loop whose speed is critical (otherwise you wouldn't be writing assembler!). The classic way to do this is to load cl with the necessary shift, but good gracious, there aren't that many registers, and it might be necessary to use ecx/cx/cl for other things in the meantime. The temptation to write the calculated shift in place in the instruction is huge! It is a single mov, and you'd need that anyway even if you saved it locally or in ds. But if you use ds you've used a global variable (or had to set up ds), and if you save it locally you need to set up a stack frame, which is another slow item. And even if you do that, you still need a mov cl, "thing" in the middle of your inner loop...
In a case like this, self modifying code saves you worrying about where to store an item which is essentially data anyway, is easy to understand, and takes instructions out of a time-critical inner loop. All these are good things, aren't they?

jad · Feb 25, 2004

did you know that sun's sparc architecture already has this facility.

you can turn it on in a system flag at boot in solaris ... however the same flag when booting solaris x86 doesn't do anything.

sfi · Feb 25, 2004

Ada has this sort of protection built in (ie convert all code to Ada). Simple.

dilettante · Feb 25, 2004

I think there is still a bit more to be clarified.

Here's another short page about a processor family I have worked on extensively, the former Burroughs Large Systems (now called ClearPath NX and LX, and in a second they'll be renamed something else):

Design of the B 5000 System

Burroughs' B6500/B7500 Stack Mechanism

Click on the Contents link to get to more pages on architectural topics.

Descendents of the B5000/6000/7000 are still in production (and use) today. In one of my work places there are three of these in use (pretty darned big mainframes).

To stay on topic, these machines use a tagged-memory architecture that isn't as wimpy as that used in IBM mainframes and "Joe Microprocessor." Memory is tagged or typed at the word level (48-bit word, 96-bit double word - ignoring the extra tag bits).

User programs can't reach the tag bits. Most system software can't even get to the tags. A program that can create an executeable program (such as a compiler or linker) has to be specially marked to get this capability. To mark a program as a "compiler" requires very elevated user privileges and can optionally be further restricted.

The tags define whether a word contains data, a descriptor, valid code, etc. All arrays are allocated by the operating system, and arrays are bounded at each end by "guard words" with non-writeable tags. Trying to run off the end of an array results in a hardware interrupt. Programs can arm traps for these exceptions, but even then they can only process the exception by performing legal actions. You can't overflow a buffer on these machines.

Short arrays can be in-stack allocated, but of course even the stack words are tagged. Even if you run off the end of an in-stack array all you can do is corrupt writeable user data words in the stack. You can't write into things like array descriptors in the stack, it causes another hardware exception. Fall of the end of an array, you hit another guard word and the process terminates that way.

Array indexing also relies on hardware protection. Those array "guard words" are more than that, as are the array descriptors you must index against: they contain the array length as actually allocated.

There isn't even any normal way to write machine instructions in a program on these machines. Even the special compiler used for the OS isn't able to create programs the normal application loader will load and run. And high-level constructs that are hazardous need to be explicitly marked "unsafe" to compile, and they aren't accepted if you don't have the proper user rights. A normal installation doesn't even include a compiler that can compile such unsafe constructs and make an executeable program - since compiling the OS doesn't even require a compiler like that. The OS code can only be booted, not run as a normal program.

You don't see these babies hacked much. You'd have to actually gain console access or "root" privileges to load something malicious. As on any platform you need to guard the keys to the kingdom, but this level of hardware protection means you can forget about buffer overflows to "root" these machines.

They had to go to extreme lengths to get a C compiler to work on these platforms. Even so, it ends up having to emulate a lot of crap a C programmer would want (like ponters that are merely integers). It's all pretty well sandboxed - and runs like crap too. I don't think anybody uses it except to port some sorts of command-line programs from the C world, and to offer POSIX compliance at some level.

"Real programmers" on these boxes write in Algol. Even the OS is written in a dialect of Algol. No, not Pascal: Algol. Extended Algol-60, the real deal. They have Pascal compilers but there hasn't been the demand to try to extend them like Borland did to make it a useful language.

dilettante · Feb 25, 2004

Hmm... it looks like the Extended PowerPC architecture used in AS/400s may also tag memory at a word level.

I can't find any details though. Anyone else? I know we have some AS/400 fans on this site.

BNPMike · Feb 26, 2004

As you imply, these protection schemes are somewhat undermined by the predilection of Microsoft et al to use deliberately unsafe languages eg C and its derivatives.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

should code and data be separated? 2

Technical User

Programmer

Technical User

MIS

Programmer

Technical User

MIS

MIS

IS-IT--Management

Technical User

Technical User

MIS

Technical User

IS-IT--Management

Technical User

Programmer

MIS

MIS

MIS

Technical User

Similar threads

Log in

Part and Inventory Search

Sponsor