Misunderstanding Computers

Why do we insist on seeing the computer as a magic box for controlling other people?
Why do we want so much to control others when we won't control ourselves?

Computer memory is just fancy paper, CPUs are just fancy pens with fancy erasers, and the network is just a fancy backyard fence.
コンピュータの記憶というものはただ改良した紙ですし、CPU 何て特長ある筆に特殊の消しゴムがついたものにすぎないし、ネットワークそのものは裏庭の塀が少し拡大されたものぐらいです。

(original post/元の投稿 -- defining computers site/コンピュータを定義しようのサイト)

Friday, June 30, 2017

Keeping the Return Address Stack Separate

So far, in this extended rant, I have
Now I want to show how a CPU could completely prevent an entire subclass of these attacks, without a whole lot of loss in processor speed.

I first came on these ideas twenty, maybe thirty years ago, when trying to figure out what made the M6809 such a magical microprocessor. (That's a subject for another day.) The M6809 was (and still is) an 8-bit microprocessor with a 16-bit address bus. That means it can address 64Kbytes of memory.

Motorola specified a memory management part called the 6829 which supported designs up to 1 megabyte of memory. It was essentially a block of fast RAM that would be used to translate the upper bits of memory, plus a latch that would select which parts of the RAM would be used to translate the upper bits of the address bus, something like this:

(This is from memory, and not really complete. Hopefully, it's enough to get the concept from.)

Memory management control would provide functions like write protect and read protect, so you could keep the CPU from overwriting program code and set parts of the address space up as guard pages.

With 32 bits and more of address, memory management doesn't quite work this simply, but this is enough to get some confidence that memory management can actually and meaningfully be done.

Now, if you are familiar with the 8086, you may wonder what the difference between the 8086's segment registers and this would be. This kind of scheme provides fairly complete control over the memory map.

The 8086 segment registers only moved 64k address windows around in the physical map, and provided no read or write control. Very simple, but no real management. The 80286 provided write protect and such, but the granularity was still abysmal, mostly based on guesses about the usage of certain registers, guesses which sort of worked with some constrained C programming language run-time models. And these guesses were frozen into silicon before they were tested. It should go without saying, that such guesses miss the mark for huge segments of the industry, but Intel's salescrew has always been trained in the art of smooth-talking.

(Intel are not the only badguys in the industry, they are just the ones who played this particular role.)

Now I knew about both of these approaches, and I knew about the split stack in Forth. And it occurred to me that, if a 6829-like MMU could talk to the CPU, and select a different task latch on accesses through the return stack pointer (S in the 6809), you could make it completely impossible to crash the return stack and overwrite return addresses.

I'm not talking about guard pages, I'm talking about the return addresses just simply can't be accesses by any means except call and return. They're outside the range of addresses that application code can generate.

Of course, the OS kernel can access the stack regions by mapping them in, but we expect the OS to behave itself.

(We would provide system calls to allow an application to have the OS adjust a return address when such is necessary.

Also, since we are redesigning the CPU, we might add instructions for exceptional return states, but I would really rather not do that. It seems redundant, since split stacks make multiple return values so much easier.)

Another thing that occurred to me is that the stack regions could be mapped to separate RAM from the main memory. This would allow calls and returns that would take no more time than regular branches or jumps.

At this point in my imaginings, I'm thinking about serious redesign of the CPU. So I thought about adding one more stack register to the 6809, a dedicated call/return stack. It would never be indexed, so it would be a very simple bit of circuitry. That would free up registers for other use, including additional stacks and such.

(Well, if we allow frame pointers to be pushed with the instruction pointers, and provide instructions for walking the stack, there would be one kind of indexing -- an instruction to fetch a frame pointer at a specific level above the current one. I'll explain how this would work, to aid understanding what's going on here:

There would be a couple of bits in the processor status area, which the OS would set before calling the application startup. The application must not be allowed to modify these bits, but, since the application must be able to confirm that the frames are present, it should be able to read them.

These bits would tell the processor which stack pointers to save with the instruction pointer on calls. The return instructions would have a bit field to determine whether to restore or discard each saved frame pointer.

"Walking the stack" would be simply a load of a specified saved frame pointer at a specified level of calling routine.

In the example shown here, the instruction GETFP sees from the status register that both LP and SP are being recorded, and multiplies the index argument by 3, then adds 2 to point to LP, checks against the return stack base register, and loads LP0 into X.

But GETFP SP,3,Y after pointing into stack that isn't there, checks the return stack base register and refuses to load the frame pointer that isn't there.

Another flag in the status register might select between generating an exception on failure and recording the failure in a status bit.

Maybe. :-/)

Could we do such a thing with the 68000 or other 32-bit CPUs? Add a dedicated call/return stack and free the existing stack pointer for use as a parameter stack in a split-stack architecture? 64-bit CPUs?


But if we intend to completely separate the return addresses, we have to add at least one bit of physical address, or we have to treat at least one of the existing address bits (the highest bit) in a special way. I think I'd personally want to lean towards adding a physical address bit, even for the 64-bit CPUs, to keep the protection simple. But, of course, there are interesting possibilities with keeping the physical and logical addresses the same size, but filtering the high bit in user mode.

And that would provide us with a new kind of level-1 cache -- 8, 16, or maybe 32 entries of spill-fill cache attached to the call/return stack, operating in parallel with a (modified) generalized level-1 cache. The interface between memory management and cache would need a bit of redesign, of course.

I'm not sure it would mix well with register renaming. At bare minimum, the call/return stack pointer would have to be completely separate from the rename-able registers.

Would this require rewriting a lot of software?

Some, but mostly just the programming language compilers would need to be worked on.

And most of the rewrite would focus on simplification of code designed to work around the bottleneck of having the return addresses mixed in with parameters.

There you have a way to completely protect the return addresses on stack.

What about other regions of memory? Can we separate them meaningfully?

That kind of thing is already being done in software used on real mainframes, so, yes. But it does have a much larger impact on existing software and on run-time speed, and it is not as simply accomplished.

But that's actually a question I want to visit when I start ranting about the ideal processor that I want to design but will probably never get a chance to. Later.

Wednesday, June 28, 2017

Proper Use of CPU Address Space

I referred to this in my overview about re-inventing the industry, but I was not very specific. Now that I've been motivated to write a rant or two about memory maps and how they can be exploited:
I can write about the ideal, perfect CPU (that may be too perfect for this world), and how it works with memory.

First, these are the general addressable regions of memory that you want to be able to separate out. I'll put them in the order I've been using in the other two rants:

  stack (dynamic variables, stack frames, return pointers)
0x7FFFxxxxxxxxxxxx ← SP
  guard page (Access to this page triggers OS responses.)
  heap (malloc()ed variables, etc.)
  statically allocated variables
  application code
  operating system code, variables, etc.

The regions we see are
  • Stack (dynamic variables, stack frames, return pointers)
  • Heap (malloc()ed variables, etc.)
  • application code (including object code, constants, linkage tables, etc.)
Operating system code should include the same sort of regions. But that should not really be visible in the application code map. Only the linkage to the OS should be visible, and that would be clumped with the application code.

Memory management hardware provides the ability to move OS code out of the application map. Let's see how that would look:

  stack (dynamic variables, stack frames, return pointers)
0x7FFFxxxxxxxxxxxx ← SP
  guard page (Access to this page triggers OS responses.)
  heap (malloc()ed variables, etc.)
  statically allocated variables
  application code

We used to talk about the problems of accidentally using small integers as pointers. Basically, when pointer variables get overwritten with random integers, the overwriting integers tend to be relatively small integers. Then when those integers are used as pointers, they access arbitrary stuff in low memory. We can notice that and refrain from allocating small integer space. And we realize that we have already dealt with small negative integers by buffering the wraparound into highest memory:

  gap (wraparound and small negative integers)
  stack (dynamic variables, stack frames, return pointers)0x7FFFxxxxxxxxxxxx ← SP
  guard page (Access to this page triggers OS responses.)
  heap (malloc()ed variables, etc.)
  statically allocated variables
  application code
  gap (small integers)

I've posted a rant about using a split stack, with a little of the explanation for why at the end. Basically, that would allow us to move those local buffers that can oerflow, crash, and/or smash the stack way away from the return address stack.

Thus, even if the attacker could muck in the local variables, he would still be at least one step from overwriting a return address. That means he has to use some harder method to get control of the instruction pointer.

Stack usage patterns actually point us to using a third stack, or a stack-organized heap separate from the random allocation heap. Parameters and small local variables could be on one stack, and large local variables on the other.

In other words, scalar local/dynamic variables would be on the second stack and vector/structure local/dynamic variables on the third. This would be especially convenient for Forth and C run-times, virtually eliminating all need of function preamble and cleanup, and simplifying stack management.

Another way to use the third stack would be to just put all the local variables on it. It might be easier to understand it this way, and I'll use the parameter/locals division below. As far as the discussion below goes, the two divisions can be interchanged. (The run-time details are significant, but I'll leave that for another day. Besides, there is no reason for a single computer to limit itself to one or the other. With a little care, the approaches could even be mixed in a running process.)

But the third stack could be optional, and its use determined by the language run time support. The OS run-time support really doesn't need to see it other than as a region to be separated from the others. Here is a possible general map, using 64 bit addressing:

  gap (wraparound and all negative integers)
  gap (large positive integers)
  return stack ← RP
  guard page (240 addresses)
  parameter stack ← SP
  guard page (240 addresses)
  local stack ← LP  gap
  guard page (really huge)
  heap (malloc()ed variables, etc.)
  guard page (240 addresses)
  statically allocated variables
  application code
  gap (small positive integer pointer guard)

If we choose to have stack frames, we could manage them very simply on the return stack by just pushing the local and/or pointer stack pointer when we push the IP. And we just discard them when we pop the IP. Or we can pop them, to force-balance the stack. This gets rid of pretty much all the complexity of walking the stack.

The gaps should be randomized, to make it harder for attacker code to find anything to abuse.

The regions we now have are
  • Return Stack (return address and maybe frame pointers)
  • Parameter stack (parameters only)
  • Locals Stack (dynamically allocated local variables)
  • Heap (malloc()ed variables, etc.)
  • Statically allocated process variables (globally and locally visible)
  • application code (including object code, constants, linkage tables, etc.)
And we have large guard regions between each.

What's missing?

Multiprocessing requires a region of memory dedicated to process (or thread) shared variables, semaphores, resource monitor counters and such. This is a separate topic, but basically the statically allocated variable area would have a section which could be protected from bare writes, with only reads and locked read-modify-write cycle instructions allowed. These would be a separate region, so their addresses could be somewhat randomized.

I'm not sure that it makes sense to manage allocation of shared variables in the malloc() sense, but there is room with this kind of scheme, and modern processors should support that many different regions of memory.

Also, regions of memory shared mmap-style would be in a separate region, or perhaps a guarded region for each. I'm not sure whether the would be protected in the same way as semaphores and monitor counters. It would seem, rather that the CPU instructions would be ordinary instructions, and the mmap region would be a resource protected by semaphore- or monitor- controlled access.

We can do the same sort of thing with 32-bit addressing, although, instead of guard pages 240 or so in size, we would be looking at guard pages between 220
and maybe 224 in size. This would be more appropriate for some controller applications.

We could do the same thing with 16-bit addressing, but it wouldn't leave us much room for the variables and code. On the other hand, looking twice at 16-bit addressing will give us clues for further refinement of these ideas. But I think I'll save that for another rant, probably another day. I have burned up enough of today on this prolonged rant.

Monday, June 12, 2017

Reinventing computers.

I mention my bad habits a bit, but I don't really go into much detail:
One of these days I'll get someone to pay me to design a language that combines the best of Forth and C.
Then I'll be able to leap wide instruction sets with a single #ifdef, run faster than a speeding infinite loop with a #define,
and stop all integer size bugs with my bare cast.
I recall trying to get a start ages ago, on character encoding, CPUs, and programming languages, and, more recently, more on character encoding.

These are areas in which I think we have gone seriously south with our current technology.

First and foremost, we tend to view computers too much as push-button magic boxes and not enough as tools.

Early PCs came with a bit of programmability in them, such as the early ROMmed BASIC languages, and, more extensively, toolsets like the downloadable Macintosh Programmers' Workbench. Office computers also often came with the ability to be programmed. Unix (and Unix-like) minicomputers and workstations generally came with, at minimum, a decent C compiler and several desktop calculator programs.

Modern computers really don't provide such tools any more. It's not that they are not available, it's that they are presented as task-specific tools, and you often have to pay extra for them. And they are not nearly as flexible (MSExcel macros?).

Computers were not given to us to use as crutches. They were given to us to help us communicate and to help us think.

I'm not alone in my interest in retro-computing, but I think I have a little bit unusual ultimate goal in my interests.

I want to go back and re-open certain paths of exploration that the industry has lopped off as being too unprofitable (or, really, too profitable for someone else).

One is character encoding. Unicode is too complicated. Complicated is great for big companies who want to offer a product that everyone must buy. The more complicated they can make things, the harder it is for ordinary customers to find alternatives. And that is especially true if they can use patents and copyrights on the artificial complexities that they invent, to scare the customer away from trying to solve his or her own problems -- or their own corporate problems, in the case of the corporate customer.

Computer are supposed to help us solve our own problems, not to impose our own solutions on unsuspecting other people, while making them pay for the solutions that really don't solve their problems.

Now, producing something simpler than Unicode is going to be hard work, harder even than putting the original Unicode together was.

Incidentally, for all that I seem to be disparaging Unicode, the Unicode Consortium has done an admirable job, and Unicode is quite useful. They just made a conscious decision to try not to induce changes on the languages they are encoding. It's a worthy and impossible goal.

And they should keep it up. Even though it's an impossible goal, their pursuing that goal is enabling us to communicate in ways we couldn't before.

But we must begin to take the next step.

  • The encoding needs to include the ability to encode a single common set of international characters/glyphs in addition to all the national encodings.
  • It needs to include
    • characters,
    • non-character numerics,
    • bitmap and vector image,
    • and other arbitrary (binary/blob) data.
  • It needs to be easily parsed with a simple, regular encoding grammar.
  • And it needs to be open-ended, allowing new words and characters to be coined on-the-fly.
Another path involves CPUs. Intel wants us all to believe that they own the pinnacle of CPU design, but, of course, that is just corporate vanity.

In the embedded world, lots of CPUs that the rest of the world has forgotten are still very much in use, because their designs are optimal in specific engineering contexts. Tradition is also influential, but there are real, tangible engineering reasons that certain non-mainstream CPUs are more effective in certain application areas. The complexity and temporal patterns of the input will favor certain interrupt architectures. The format of the data will favor specific register set constructions. Etc.

Many engineers will acknowledge the old Motorola M6809 as the most advanced 8-bit CPU ever, but it seems to have been a dead-end. ("Seems." It is still in use.) "Bits of it lived on in the 68HC12 and 68HC16." But the conventional wisdom is now that, if you need such an advanced CPU, it's cheaper to go with a low-end 32-bit ARM processor.

What got left behind was the use of a split stack.

The stack is where the CPU keeps a record of where it has been as it chases the branching trails of a problem's solution. When the CPU reaches a dead end, the stack provides an organized structure for backtracking and starting back down new branches in the trail.

Even "stackless" run-time environments tend to imitate stacks in the way they operate, because of a principle called problem context, in addition to the principle of backing out of a non-workable solution.

But the stack doesn't just track where the CPU has been. It also keeps the baggage the CPU carries with it, stuff called local (or context-local) variables. Without the data in that baggage, it does no good for the CPU to try to back up. The data is part and parcel of where it has been.

Most "modern" CPUs keep the code location records in the same memory as the context-local data. It seems more efficient, but it also means that a misbehaving program can easily lose track of both the context data and the code location at once. When that happens, there is no information to analyze about what went wrong. The machine ends up in a partially or completely undefined state.

Worse, in a hostile environment, such a partially defined state provides a chance for attacking the machine and the persistent data that it keeps on the hard disk. (Stack crashes are most effective when the state of the program has already become partially undefined.)

[JMR201706301711: I've recently written rather extensively on stack vulnerabilities and using a split stack to reduce the vulnerabilities:

Splitting the stack allows for more controlled recovery from error states that haven't been provided for. In the process, it reduces the surface area susceptible to attack.

The split stack also provides a more flexible run-time architecture, which can help engineers reduce errors in the code, which means fewer partially-defined states.

There are a couple of other areas in which so-called modern CPUs in use in desktop computers and portable data devices are not well matched to their target application areas, and the programming languages (and operating systems), reflecting the hardware, are likewise not well-matched. This is especially true of the sort of problems we find ourselves trying to solve, now that we think most of the easy ones have been solved.

In order to flesh out better CPU architectures, I want to build a virtual machine for the old M6809, then add some features like system/user separation, and then design an expanded address space and data register CPU following the same principles.

I'm pretty sure it will end up significantly different from the old 68K family. (The M6809 is often presented as a "little brother" to the 68K, but they were developed separately, with separate design goals. And Motorola management never really seemed to understand what they had in the 6809.)

Once I have an emulator for the new CPU, I want to develop a language that takes advantage of the CPU's features, allowing for a richer but cleaner procedural API that becomes less of a data and time bottleneck. It should also allow for a more controlled approach to multiprocessing.

And then I want to build a new operating system on what this language and CPU combination would allow, one which would allow the user to be in control of his tools, instead of the other way around.

This is what I mean when I say I am trying to re-invent the industry.

Saturday, May 27, 2017

A New Kind of SPAM (Unsolicited Advertising)

First, let's get one thing straight. The true meaning of "SpAM" as an epithet for unsolicited advertising is
SPanning Advertisement Message
I know this because I wasn't there. ;-)

This was told to me by a fellow student when I was an undergraduate at BYU during the days before the Internet was the Internet. The semi-apocryphal story referenced the cool idea that wasn't so cool after all, of taking a "span" of the usenet newsgroups and crossposting an announcement to them all.

For instance, if you wanted to advertise a conference on religious studies, you might send a message to "soc.religion.*", meaning all the newsgroups under soc.religion.

Since mailing lists have pretty much replaced the usenet newsgroups, the number of groups in general, and the number under soc.religion, in particular, has fallen drastically. Indeed, it's almost hard to find access to usenet itself, any more.

(Google hasn't helped, really, in trying to merge usenet with their Google Groups. Maybe that merger wasn't what they were trying to do, but that has been the result. And the independent usenet is disappearing.)

Discussing the pros and cons of usenet, both as a concept, and as it really happened, is not what I want to do today.

It used to be that there were more than fifty newsgroups under soc.religion:
soc.religion.atheists, ... soc.religion.bahai, soc.relgion.catholic, ... soc.religion.christian, ... soc.religion.lutheran, ... soc.religion.mormon, soc.religion.universalist, ...
And some of the groups had subgroups, like soc.religion.mormon.meetinghouse. (In this particular case, the parent soc.religion.mormon group was supposed to be for "serious" discussion, and the meetinghouse group for the kind of talk you'd hear after the meetings were over. Each subgroup had it's own reasons and rules.)

Each group had tens, hundreds, and even thousands of subscribers. One spanning message could result in an avalanche of hundreds of thousands, or even millions of messages flooding the lines in those good old days of slow network access. (Our lines were tens of thousands times slower back then.)

People wanted their bandwidth for other things, of course.

But the business types wanted their pipes into people's mailboxes.

Because it takes real money to maintain the internetwork connections that compose the Internet, the business aspect of the Internet has not disappeared, and I guess it won't disappear any time soon. (I haven't enabled ads in this blog yet, but several other blogs have ads enabled, as a kind of moral nod to Google's largess. If my novels were selling, ...)

I've blogged about the archetypical spam before. Most of it is pretty transparent, if you can look at yourself with any semblance of detachment, and if you have any sense of whom you know and whom you don't know.

Messages from people you don't know are to be suspect, first and foremost. They are especially to be suspect if they make some sort of offer.

If the offer is unreasonable, such as money or love for nothing, it's going to be scarier if it is real. Seriously.

A million dollars from someone you've never met?

You'd better believe, if the money is real, it doesn't come without strings attached, burdensome strings, strings that are very hard to get rid of.

Money is like that anyway -- it always comes with strings attached. Big money comes with ropes.

Sex? P0rn? Drugs? It's the same. If it claims to come at no cost, you should be running away from it at full speed.


Here's part of why p0rn is bad, BTW: it seems to offer intimacy at no cost. It isn't real intimacy, and there is definitely a cost.


The Love of Jesus? Well, that's different. Except that, even Jesus expects you to follow His teachings if you want to benefit from being saved.

Good things take some sort of effort, if you really want the good things to be good.

If everyone in the world could understand that, no one would respond to fraudulent junkmail, and the fraudulent stuff would disappear.

(Well, if everyone understood, the need for sending the SPAM would disappear
pretty quickly, too.)

But we have SPAM. Some of the come-ons are getting more sophisticated. Here's one I got today:

How Secure is your Wi-Fi ?

Sender: anonymity@privacyissafety.com
(I changed the URL so I'm not advertising a company I know nothing of, but it was similarly buzzword-cool.)

Does that sound suspicious to you? It sounds suspicious to me. At least, any website domain name composed of buzzwords bears investigation. Do not take them at face value.

Let's look at what it says:
Wondered if anyone could be sniffing on every bit of your oline activities and data? , yes its just started happening..The US senate just recently passed a bill (S.J.RES 34) which allows your ISP (Internet Service Provider) to sell your web browsing histories and geolocation data to advertisers and partner companies. Yes, you read that right, your privacy is up for sale with the support of the US Senate.

Not Cool right?

Yes so uncool!
If you are a native English speaker, you may get a feeling as if someone who doesn't know how to be cool is trying to sound cool. But that is not the big problem.

What is SJ res 34?

Here is something a search brings back:
This doesn't make sense, but it appears that the collected Congresscritters have indeed sold your Internet-related private information down the river. It doesn't make sense, but, if it hasn't really happened yet, it's going to happen. When you tell people to make your laws and then take the leash off them, they will make all sorts of laws that help keep them and their friends in power. Expecting them to refrain from doing so makes even less sense than the things they do.

(This is the real danger in allowing NSA to do what they do. If they can do it, why can't Verizon? Especially since it's Verizon and the other ISPs who do the actual eavesdropping? And Google. Who can disobey King Money?)

I'll have to do more research on SJ resolution 34, but, in the meantime, let's look some more at this unrequested advertisement.

We cant let them do this, so here is how to easily take charge of your privacy from the deep root.

If you want to encrypt the online activity on all your business or family’s devices, set up a VPN router. A VPN router encrypts internet traffic at the source by default—so you won’t have to remember to switch on your VPN each time you start a device.
It's an ad for a VPN router.

If you think you are interested, you have to understand. Your provider can't send a request out for you without knowing what request it is sending out and where it is going. The problem being referred to here is how long they can keep the logs of your requests, and whether they can sell them.

VPN doesn't change that. Even if the VPN company proxies all your requests, the only thing changed is that you are now trusting two companies -- your provider and your VPN provider. And you have no promise, no reason to believe that they will keep each other in check.

VPN is a legitimate service, even if the ads are borderline and unrequested. But you really would prefer to have someone you know personally run your VPN.

Without investigating the company who ostensibly sent this, I can't say they are either legitimate or fraudulent. And I don't particularly want to, because then I wouldn't have the marginal junkmail example that I want. For now, I can call this a
marginal junkmail
And that is the thing I want to rant about today.

Jon Postel was a wonderful guy. He was also an idealist.

He set up a relatively clean foundation for the Internet we have, in which all participants could be equal.

The problem with equality is that we don't all behave as if we are equal.

Apple (the computer company) has to have their cachet. So does Microsoft.


Because they don't know how to be equal.

Google used to know how to be equal, but that only lasted until they had to start turning a profit. (Hey, that's what happened to Microsoft and Apple, too, isn't it?)

If you want to play as an equal with the "big boys", you have to learn how to play by their rules. That means you have to be able to read advertisements and figure out whether they are legitimate or not, and then start figuring out whether what they are advertising is something you need. And you have to have money and technical expertise to set up your own infrastructure. And you have the time, or you have to be able to hire someone who can do it for you.

In the fields where you can do that, you can be okay as a peer. In the fields where you can't, you are not so equal.

The egalitarian Internet has to be there.

But we do need closed networks, as well.

The question I want to try to make obvious here is this --

Who do you trust to build your closed networks?

And the answer to that actually lies in something Jon Postel and his buddies tried to build into the Internet, something that the current crop of providers are trying to dismantle:


Distributed networks, distributed control.

Distributed control means that you control your part

And you can't control your part when you don't have some sort of understanding how it works. Nor can you when you can't control the hardware that runs your part of it. Nor when you can't control how it connects with the rest.

Which is really the reason for this blog.

(And the rant I posted several days back about the modern Danegeld.)

Friday, April 28, 2017

LSB 1st vs. MSB 1st -- Ignore Gulliver

I was working on a programming problem for a novel I'm trying to write, and got this bug under my skin again.

Something like fifty years ago, an argument raged among computer engineers over the order in which numbers should be stored in computer memory.

During the arguments, some (mostly among the least-significant-first camp, IIRC) pointed out the Lilliputian argument between the "Little-Endians" and "Big-Endians" in Gulliver's Travels. The least-significant-first camp claimed the position of Little-Endian-ness, which left the most-significant-first camp associated with High Church once the allegory was commonly accepted.

In Gulliver's Travels, the arguments between Lilliput and Blefuscu, including the endianness argument, are depicted by Swift as mere matters of fashion.

Most of the least-significant-first camp took the same approach: In memory, endianness doesn't matter.

This was a bit of duplicitous implicit poisoning of the well, similar to the habit Intel salescrew had a decade or two later of claiming that Turing complete CPUs were all equivalent, therefore we should all buy their self-proclaimed most popular CPU -- false analogy and a lot of logical steps being skipped among other things.

To summarize the question of byte order, we need to take a general look at data in computer memory. Computer memory is organized as an array of sequentially addressable elements, which implies an ordering to the contents of memory:

Example 1, no number.
00010203040506070809 10111213141516171819 202122
Data in memor y. 0 0 0 0 0 0 0 0

Let's put the number 123456 (one hundred twenty three thousand four hundred fifty-six) encoded as text after the description:

Example 2, number 123456 as text.
00010203040506070809 10111213141516171819 202122
Data in memor y: 123456 0

Note that text is naturally recorded most significant digit first in English.

(Thousands group separators just get in the way in computers, so I just left the comma out.)

If we wrote 123456 textually least significant digit first, it would look like this:

Example 3, number 123456
as least significant digit first text.
00010203040506070809 10111213141516171819 202122
Data in memor y: 654321 0

You may be wondering why someone would write numbers backwards, but there are actually language/numeric contexts in which least significant digit first is the common order. (They may be useful to refer to later.) Even in English, we have contexts like dates where the order is not most significant first:
  • September 17, 1787 (mixed order) and
  • 17 September 1787 (least significant first)

So we know that it is not completely unthinkable to do such a thing.

Now, text is actually encoded in computer memory as strings of numeric codes. Let's look at the data in the second example, reading it as hexadecimal numbers that represent the characters of the text instead of interpreting it as text:

Example 2, view 2, (ASCI/Unicode UTF-8),
raw contents displayed in hexadecimal.
00010203040506070809 10111213141516171819 202122
4461746120 696E20 6D656D6F72 793A20 313233 343536 00

That's interesting, isn't it?


Okay, let's leave everything but the number interpreted as text:

Example 2, view 3,
numeric text "123456" uninterpreted and shown in hexadecimal.
00010203040506070809 10111213141516171819 202122
Data in memor y: 313233 343536 00

Now, we haven't actually been changing what is in memory in Example 2. We're just changing how we look at it. We are trying to get a sense of what is actually stored in memory. (If you have a decent OS, you have command line tools like hexdump that allow you to look at files this way. You should try it some time.)

So, now let's try changing the form of the number. Instead of putting it in as text, let's put it in as a number -- an integer. (It's convenient that the address where it will go is 16, for something we call alignment, but we won't really talk about that just yet.)

First, we need to rewrite 123456 (one hundred twenty-three thousand four hundred fifty-six) as a hexadecimal number:
123456 ÷ 164 = 1 rem 57920
57920 ÷ 163 = 14 (E16) rem 576
576 ÷ 162 = 2 rem 64
64 ÷ 161 = 4 rem 0
123456 == 1E24016
Two hexadecimal digits take one byte in memory on a computer with an 8 bit byte.

(Numbers up to 4294967295 (FFFFFFFF16) can be stored in four bytes on computers with an 8 bit byte.)

Let's look at 123456 (1E24016) stored at location 16, most significant byte first:

Example 4 MSB,
123456 (1E24016) directly in memory, MSB first.
00010203040506070809 10111213141516171819 202122
Data in memor y: 0001E2 4000 0

Now let's look at the same number, stored least significant byte first:

Example 4 LSB,
123456 (1E24016) directly in memory, LSB first.
00010203040506070809 10111213141516171819 202122
Data in memor y: 40E201 0000 0

For a CPU that is MSB first, it will always store and read MSB first (as in example 3), so there's no problem.

And an LSB first CPU will always store and read LSB first, so, again, no problem.

The CPU is built to do it one way or the other, and it will always do it the way it's built, so there's no problem here.

That's the essence of the argument.

It's no longer true, and it really was never true. All bus masters have to agree on how they store numbers in a particular chunk of data or the numbers get turned upside down. (Or in the case of mixed mode integers, inside out and upside down, etc.)

Back then, however, CPUs did not usually have the ability to switch byte order without a bit of work. And alternate bus masters were not as common as now, and usually tended to be built specifically for the CPU.

These days, with intelligent I/O devices, alternate bus masters are rather common. (Graphics co-processors, network interfaces, disk drive interfaces, etc.) If one is going to be a bad boy and do byte order backwards from the rest, unless you isolate the bad boy somehow, things tend to fall apart.

But even the ability to switch byte order does not materially change the arguments.

On a CPU that can switch byte order natively, byte order becomes just another property of the integer stored in memory, which the programmer must keep track of, along with the address, size, presence of sign, etc. As long as the software and hardware respect the properties of the integer in memory, there is no problem.

Well, no problem in isolation.

But there is one virtual bus master that tends, in most of the world, to be most significant first when looking at numbers -- the human who might debug the program by looking at the raw contents of memory without access to the detail of the compiled program.

No number exists in isolation.

There it is, the fatal assumption of the argument:
... in isolation ...
Nothing in this world exists in isolation.

Why am I going around in circles on this subject?

In modern hardware, we have multiple CPUs and other devices on the computer bus.

Even in the past, the programmer often had to look at what was in memory in order to tell what the program was doing. He was, in effect, another CPU on the bus, as I said above.

Before we take a look at the reasons not to use least significant first byte order, let's look at the primary argument in favor: It theoretically speeds up some hardware processes and made the 8080 and 6502 (among other CPUs) cheaper to produce.

To figure out why, when you perform math on numbers, you start at the least significant end. Let's do a subtraction of two moderately large numbers:

 - 98765
You started with the column on the right,
6 - 5 = 1

CPUs have to point at what they work on, and the idea is that, if they are pointing at the number already, it's handy to be pointing at the first byte to do the math on.

It sounds reasonable, now that you think of it, right?

There are some other issues, like aligning the number before you start, which also appear to have some advantage when the small end is what you point at.

Sounds like maybe the Little-Endian engineers know what they are onto?.

Oh, dear. Maybe the Big-Endians should just shut up.

Well, let's put those arguments aside for a moment and talk about what the human who is trying to debug a program is going to see when he or she looks at a number stored least significant byte first. I'm pretty sure I can show you some problems with the Little-Endian attitude here.

Simple tools are the ones that are usually available. We'll make use of hexdump. If you are working on a Microsoft Windows workstation, you can install Cygwin to get Unix tools, and Cygwin can give you access to hexdump and the gnu C compiler, gcc, and gforth (and lots of other good stuff like bc).

We'll also make use of a little programming in C:

/* Program to demonstrate the effect of LSB1st vs. MSB1st integers
// by Joel Matthew Rees, Amagasaki, Japan
// Copyright 2017 Joel Matthew Rees
// All rights reserved.
// Permission granted to use for personal purposes.
// See http://defining-computers.blogspot.com/2017/04/lsb-1st-vs-msb-1st-ignore-gulliver.html
// Can be downloaded here:
// https://osdn.net/users/reiisi/pastebin/5027

#include <stdio.h>
#include <stdlib.h>


#include <limits.h>
#  define byteWidth ( (size_t) CHAR_BIT )
#  define byteMask ( (unsigned long) (unsigned char) ( (unsigned long) -1 ) )
#  define ulBytes ( sizeof (unsigned long) ) /* a run-time constant */
unsigned byteWidth = 8; /* Not depending on limits.h . */
unsigned long byteMask = 0xFF;
unsigned ulBytes = 4; /* Sane throw-away initial values. */

void setULbytes( void )
{  int i = 0;
   unsigned char chroller = 1;
   unsigned char chMask = 1;
   unsigned long ulroller = 1;
   while ( chroller != 0 )
   {  chroller <<= 1;
      chMask = ( chMask << 1 ) | 1;
   byteMask = chMask;
   byteWidth = i;
   i = 0;
   while ( ulroller != 0 )
   {  ulroller <<= 1;
   ulBytes = i / byteWidth;

int putLSB( unsigned long ivalue, int early )
{  int i = 0;
   {  putchar( ivalue & byteMask );
      ivalue >>= 8;
   } while ( ( i < ulBytes ) && !( early && ( ivalue == 0 ) ) );
   return i;

int putMSB( unsigned long ivalue, int early )
{  int i = 0;
   {  putchar( ( ivalue >> ( ( ulBytes - 1 ) * byteWidth ) ) & byteMask );
      ivalue <<= byteWidth;
   } while ( ( i < ulBytes ) && !( early && ( ivalue == 0 ) ) );
   return i;

void fillch( int count, char ch )
{  while ( count-- > 0 )
   {  putchar( ch );

int printInteger( unsigned long ivalue, unsigned base )
{  char buffer[ 65 ];
   char * cpt = buffer + 65;
   * --cpt = '\0';
   if ( base > 36 )
   { base = 10;
   {  int ch = ivalue % base;
      ivalue /= base;
      ch += '0';
      if ( ch > '9' )
      {  ch += 'A' - '9' - 1;
      * --cpt = ch;
   } while ( ivalue > 0 );
   fputs( cpt, stdout );
   return 64 - ( cpt - buffer );

int main( int argc, char *argv[] )
   unsigned long my_integer = 123456;
   int index = 1;
   int length;

   if ( argc > 1 )
   {  char * endpt = argv[ 1 ];
      my_integer = strtoul( argv[ 1 ], &endpt, 0 );
      if ( endpt > argv[ 1 ] )
      {  ++index;
      {  my_integer = 123456;

   printf( "Data in memory: " );
   length = printInteger( my_integer, 10 );
   fillch( 32 - length, '\0' );
   length = printInteger( my_integer, 16 );
   fillch( 32 - length, '\0' );

   printf( "LSB1st early:   " );
   length = putLSB( my_integer, 1 );
   fillch( 16 - length, '-' );

   printf( "LSB1st full:    " );
   length = putLSB( my_integer, 0 );
   fillch( 16 - length, '-' );

   printf( "MSB1st early:   " );
   length = putMSB( my_integer, 1 );
   fillch( 16 - length, '-' );

   printf( "MSB1st full:    " );
   length = putMSB( my_integer, 0 );
   fillch( 16 - length, '-' );
   putchar( '\n' );

   return EXIT_SUCCESS;


This can be downloaded at
A previous version at

will eventually be taken off line.


Compile it with the usual
cc -Wall -o lsbmsb lsbmsb.c
and run it with something like
  • ./lsbmsb | hexdump -C
  • ./lsbmsb 1234567890 | hexdump -C
  • ./lsbmsb 0x12345 | hexdump -C
  • ./lsbmsb 0x12345 | hexdump # look at it two-byte.
  • ./lsbmsb $(( 123456 * 256 )) | hexdump -C
  • # etc.
Be sure to leave the -C off a few times, to see what happens when it tries to interpret memory as a series of sixteen bit words instead of a series of eight bit bytes.


me@mycomputer:~/work/mathgames/eco101$ ./lsbmsb | hexdump -C
00000000  44 61 74 61 20 69 6e 20  6d 65 6d 6f 72 79 3a 20  |Data in memory: |
00000010  31 32 33 34 35 36 00 00  00 00 00 00 00 00 00 00  |123456..........|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000030  31 45 32 34 30 00 00 00  00 00 00 00 00 00 00 00  |1E240...........|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000050  4c 53 42 31 73 74 20 65  61 72 6c 79 3a 20 20 20  |LSB1st early:   |
00000060  40 e2 01 2d 2d 2d 2d 2d  2d 2d 2d 2d 2d 2d 2d 2d  |@..-------------|
00000070  4c 53 42 31 73 74 20 66  75 6c 6c 3a 20 20 20 20  |LSB1st full:    |
00000080  40 e2 01 00 00 00 00 00  2d 2d 2d 2d 2d 2d 2d 2d  |@.......--------|
00000090  4d 53 42 31 73 74 20 65  61 72 6c 79 3a 20 20 20  |MSB1st early:   |
000000a0  00 00 00 00 00 01 e2 40  2d 2d 2d 2d 2d 2d 2d 2d  |.......@--------|
000000b0  4d 53 42 31 73 74 20 66  75 6c 6c 3a 20 20 20 20  |MSB1st full:    |
000000c0  00 00 00 00 00 01 e2 40  2d 2d 2d 2d 2d 2d 2d 2d  |.......@--------|
000000d0  0a                                                |.|
me@mycomputer:~/work/mathgames/eco101$ ./lsbmsb | hexdump
0000000 6144 6174 6920 206e 656d 6f6d 7972 203a
0000010 3231 3433 3635 0000 0000 0000 0000 0000
0000020 0000 0000 0000 0000 0000 0000 0000 0000
0000030 4531 3432 0030 0000 0000 0000 0000 0000
0000040 0000 0000 0000 0000 0000 0000 0000 0000
0000050 534c 3142 7473 6520 7261 796c 203a 2020
0000060 e240 2d01 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d
0000070 534c 3142 7473 6620 6c75 3a6c 2020 2020
0000080 e240 0001 0000 0000 2d2d 2d2d 2d2d 2d2d
0000090 534d 3142 7473 6520 7261 796c 203a 2020
00000a0 0000 0000 0100 40e2 2d2d 2d2d 2d2d 2d2d
00000b0 534d 3142 7473 6620 6c75 3a6c 2020 2020
00000c0 0000 0000 0100 40e2 2d2d 2d2d 2d2d 2d2d
00000d0 000a                                  
me@mycomputer:~/work/mathgames/eco101$ ./lsbmsb 0x1E24000 | hexdump -C
00000000  44 61 74 61 20 69 6e 20  6d 65 6d 6f 72 79 3a 20  |Data in memory: |
00000010  33 31 36 30 34 37 33 36  00 00 00 00 00 00 00 00  |31604736........|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000030  31 45 32 34 30 30 30 00  00 00 00 00 00 00 00 00  |1E24000.........|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000050  4c 53 42 31 73 74 20 65  61 72 6c 79 3a 20 20 20  |LSB1st early:   |
00000060  00 40 e2 01 2d 2d 2d 2d  2d 2d 2d 2d 2d 2d 2d 2d  |.@..------------|
00000070  4c 53 42 31 73 74 20 66  75 6c 6c 3a 20 20 20 20  |LSB1st full:    |
00000080  00 40 e2 01 00 00 00 00  2d 2d 2d 2d 2d 2d 2d 2d  |.@......--------|
00000090  4d 53 42 31 73 74 20 65  61 72 6c 79 3a 20 20 20  |MSB1st early:   |
000000a0  00 00 00 00 01 e2 40 2d  2d 2d 2d 2d 2d 2d 2d 2d  |......@---------|
000000b0  4d 53 42 31 73 74 20 66  75 6c 6c 3a 20 20 20 20  |MSB1st full:    |
000000c0  00 00 00 00 01 e2 40 00  2d 2d 2d 2d 2d 2d 2d 2d  |......@.--------|
000000d0  0a                                                |.|
me@mycomputer:~/work/mathgames/eco101$ ./lsbmsb 0x1E24000 | hexdump
0000000 6144 6174 6920 206e 656d 6f6d 7972 203a
0000010 3133 3036 3734 3633 0000 0000 0000 0000
0000020 0000 0000 0000 0000 0000 0000 0000 0000
0000030 4531 3432 3030 0030 0000 0000 0000 0000
0000040 0000 0000 0000 0000 0000 0000 0000 0000
0000050 534c 3142 7473 6520 7261 796c 203a 2020
0000060 4000 01e2 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d
0000070 534c 3142 7473 6620 6c75 3a6c 2020 2020
0000080 4000 01e2 0000 0000 2d2d 2d2d 2d2d 2d2d
0000090 534d 3142 7473 6520 7261 796c 203a 2020
00000a0 0000 0000 e201 2d40 2d2d 2d2d 2d2d 2d2d
00000b0 534d 3142 7473 6620 6c75 3a6c 2020 2020
00000c0 0000 0000 e201 0040 2d2d 2d2d 2d2d 2d2d
00000d0 000a                                  

Now you may be saying you'd rather not be looking at any of that, but if you really had to, if you had no choice but to look at one or the other, which would you rather look at? LSB1st or MSB1st? Remember, the numbers you are looking for will usually be mixed with text, and the text will likely help you find what you are looking for. If the text gets byte-swapped on you, it's going to be just that much harder.

The salesman says he has tools to let you look at the data, so you don't have to worry. That's all well and good, but it makes you dependent on the vendor, even when the vendor has time and budget to help you.

When the vendor doesn't have time or budget, wouldn't rather be able to use simple tools, at any rate? -- as a start before you set to making your own tools?

Somebody usually pipes up with, "Well, if you guys would all join us Little-Endians, if everybody did it all the same, there'd be no problems!"

So. From now on, everyone does Little-Endian. Blogs? News aggregators? Textbooks? Novels? Are we going to go back and reprint all the classics with Little-Endian numbers?
  • 71 September 7871?
No, of course not. Much easier to just become dependent on our vendor. I mean, we trust them, right? And they deserve a guaranteed revenue stream, too.

Somebody pipes up about now saying everything I'm talking about is human stuff, not technical at all.

The Unicode Consortium determined that they did not want to be caught up in the argument. So they decided that Unicode characters could be encoded either direction. They even figured out how to put a flag called the Byte Order Mark at the beginning of a stream of Unicode text, to warn the consumer of the stream what order to expect the characters in.

Characters, you see, are not integers after all, contrary to the opinions of many a respected computer scientist. Little-Endian byte order enforces this factoid.

Well, the folks who defined the Internet decided they did not want to be reading data streams and crossing their eyes to read the IP addresses and other numeric data buried in the stream. So network byte order is the one that is easy to read, most significant first. If one hundred twenty-three thousand four hundred fifty-six is in the data stream, it shows up as 123456, not 654321.

In setting up the demonstrations of byte order differences, I went to some pain to demonstrate one big difference between the byte orders. If you are looking carefully at the dashes, you may see how least significant first allows you to optimize math. If you can track the presence of carries, you can stop adding small numbers to big ones as soon as the carries disappear.

Looks interesting, right?

Tracking the carries takes more processor time and resources than just simply finish the addition out. This is one of those false early optimizations that has historically killed a lot of software projects.

Worse, the programmer can look at one of these and think a particular case will never generate carries. This is almost always self-deception. The assumptions required to keep the carries from happening are almost always not valid in the end-user's context just often enough to cause hidden bugs of the integer overflow variety.

Isn't that strongly enough stated?

When we humans look at numbers, we perceive them as text. That allows us to do many things without consciously thinking of them, like move to the end or align them. CPUs have to do the same things with numberical text, as we can intuit by looking back at example 2.
When CPUs work with numbers, they have to figure out all sorts of things about the number which we subconsciously read from the text --
Is there a sign? 
Is there a decimal point?
How big is the number?
If there is no text, they have no clue ... unless the programmer has already told them.

Here is perhaps the strongest argument against least significant first: It induces bugs into software.

Some people think it's a good thing to induce bugs into software. They think it guarantees their after-market revenue stream.

I think there will always be enough bugs without practicing dodgey false optimizations, but what do I know? I've wasted two days I didn't have tilting at this, erm, rainbow. (Or chasing this windmill, maybe?)

One of these days I'll get someone to pay me to design a language that combines the best of Forth and C. Then I'll be able to leap wide instruction sets with a single #ifdef, run faster than a speeding infinite loop with a #define, and stop all integer size bugs with a bare cast. And the first processor it will target will be a 32/64 bit version of the 6809 which will not be least significant bit first.

Friday, April 21, 2017

The Problem with Full Unicode Domain Names -- apple.com vs. appІe.com

Well, this one is taking longer to boil over than I expected. I've been watching for the storm for over fifteen years, and convoluted fixes on fixes have dodged the bullet this long.

[JMR201704221101: addendum]

I should note that the primary danger comes from clicking links given you by untrusted sources. The best solution here is not to do that. Abstain. Don't click on the links.

Copy them out, look at them in a text editor using a technical font that shows differences between I, 1, and l, and between 0 and O, etc.

Plug the URL into the search field of a web search engine -- Not into the URL bar of your browser, that takes you straight there. Let the search engine tell you what it knows about the site before you go there.

Then type in the domain name part by hand. If you have the URL

the domain name part is
(There's more that can be said, but I don't want to confuse you about controlling domains, so just type the whole domain name.)

If that's too much trouble, maybe you didn't want to go there anyway. But at least click on something the search engine shows you instead of the link in the e-mail.

[JMR201704221101: end addendum]

The problem:

Depending on your default fonts, you may be able to see a difference between the following two domain names:

apple.com vs. appІe.com
It's similar to the problem with
apple.com vs. appIe.com
but with a twist. The first one uses a Cyrillic (as in Russian) character to potentially cause confusion, where the second one keeps the trickiness all in the Latin (as in English) alphabet.

Let's look at both of those again, and I'll try to specify a font where there will be problems. First, we'll try the Ariel font (if it's on your computer):
apple.com vs. appІe.com
(Latin little "l" -- Cyrillic capital "І")
and next the Courier font (if it's on your computer):
apple.com vs. appІe.com
(Latin little "l" -- Cyrillic capital "І")
And we'll look at the Latin-only domain names, first in Arial:
apple.com vs. appIe.com
(Latin little "l" -- Latin capital "I")
and then in Courier:
apple.com vs. appIe.com
(Latin little "l" -- Latin capital "I")

Do you see what's happening?

Someone could grab the domain with the visual spoof and trick you into giving them your Apple login and password and maybe even your credit card number.

When domain names were all lower case Latin, we had fewer problems. In other words,
was properly spelled
and the browser would display it in the latter form.

Well, there was still the problem with
substituting the number "1" for the little "l". But the registrars tended to try to help by refusing to register confusing domain names. And browsers were careful to use fonts that would show the differences in the URL bar.

Some time ago, pretty much all Unicode language scripts became allowed in domain names. This was strongly pushed by China, where they did not want {sarcasm-alert} to have all their loyal subjects surfing the Internet in Latin. That would let everyone see how superior English is, and that would never do.{end sarcasm-alert.}

(I shouldn't be sarcastic. They do need Chinese URLs. Otherwise, there would be too many companies competing for bai.com and ma.com.)

Apparently, non-Latin scripts seem to be even allowed to use capitals. Or, at least, unscrupulous or careless registrars seem to be allowing them in some cases. I'm not sure why.

(Here's the RFC. What am I missing?)

If the Cyrillic visual spoof I am using as an example were coerced to lower case in the URL bar, here's what it would look like in the Ariel font:
apple.com vs. appіe.com
(Latin little "l" -- Cyrillic lower case "і"
That would solve a lot of problems.

If you are worried about this, one thing that can help if you are using Firefox, type
in the URL bar. (That's where URLs like
show up, and you can type them in by hand to go there.)

You'll get a warning that tells you that the Mozilla Foundation is not going to take the blame if you use non-default settings. (They won't anyway, but don't check the box that says you don't want to be warned. And remember that you have done this.)

Use the search bar to search for
and you'll find
Double-click the "false" and it will turn to "true". And then URLs like
will be displayed in the status bar as URLs like
Now, that's ugly, don't you think? Anyway, you won't be mistaking it for
(This is called punycode. Hmm. Actually, the Japanese page on punycode shows what's happening a little better than the English page.)

Then again, you will be wondering what that URL means. So I don't really know if I want to recommend it.

If I were a Mozilla developer involved with this, I would take a clue from what I've done above and do it like this:
www.apple.com (all Latin)
www.appІe.com (Cyrillic "І")
In other words, all the characters in URLs from languages other than the browser's default language would be displayed with colored backgrounds to make them stand out. And I might even add a warning bubble or something that said,
Warning! Mixed language URL contains Cyrillic "І"!
floating over the URL. This approach would mitigate a lot, including
  • Іds.org (Cyrillic)
  • аррӏе.com (Cyrillic)
  • perl.org (zenkaku, or full-width)
and so forth.

(I thought this was in the RFCs, but I'm not seeing it. Maybe I'm remembering my own thoughts on how to mitigate this particular semantic attack.)

I have advocated improving Unicode by reconstructing the encoding and including an international character set where such visual doublings could be eliminated. And separating Chinese and Japanese language encodings, and the three different Chinese encodings from each other, as well.

Nobody seems to like the idea.

It's a lot of work.

I'd be willing to do it relatively cheap! (Relatively.)

Model Boot-up Process Description, with Some References to Logging

This is a description of a model boot-up process 
for a device that contains a CPU,
with Some References to Logging.

(This is a low-level follow-up to theses posts:
which may provide more useful information.)

This is just a rough model, a rough ideal, not a specification. Real devices will tend to vary from this model. It's just presented as a framework for discussion, and possibly as a model to refer to when documenting real hardware.

(1) Simple ALU/CPU test.

The first thing the CPU should do on restart is check the Arithmetic-Logic Unit, not in the grand sense, but in a limited sense.

Something like (assuming an 8 bit binary ALU) adding 165 to 90 and checking that the result comes out 255 (A5sixteen + 5Asixteen == FFsixteen), and then adding 1 to the result to see if the result is 0 with a carry, would be a good, quick check. This would be roughly equivalent to trying to remember what day it is when you wake up, then checking to see that you remember what the day before and the day after are.

It doesn't tell you much, but it at least tells you that your brains are trying to work.

* If the ALU appears to give the wrong result, there likely won't be much that can be done -- maybe set a diagnostic flag and halt safely.

* In some devices, halting itself is not safe, and an alternative to simply halting such as having the device securely self-destruct may be safer. Halting safely may have non-obvious meanings.

Now, it's very likely that this test can be made a part of the next step, but we need to be conscious of it.

(2) Initial boot ROM test.

There should be an initial boot ROM that brings the CPU up. The size should be in the range of 1,000 instructions to 32,768 instructions.

Ideally, I would strongly suggest that it contain a bare-metal Forth interpreter as a debugger/monitor, but it may contain some other kind of debug/monitor. It may just contain a collection of simple Basic Input-Output library functions, but I personally do not recommend that. It needs to have some ability to interact with a technician.

And, of course, it contains the machine instructions to carry out the first several steps of the boot-up process.

This second step would then be to perform a simple, non-cryptographic checksum of the initial boot ROM.

Which means that the ROM contains its own test routines. This is clearly an example of chicken-and-egg logical circularity. It is therefore not very meaningful.

This is not the time for cryptographic checksums.

* Success does not mean that the CPU is secure or safe. Failure, on the other hand, gives us another opportunity to set a diagnostic flag for a technician to find, and halt safely, whatever halting safely means.

On modern, highly integrated CPUs, this ROM is a part of the physical CPU package. It should not be re-programmable in the field.

(That's one reason it should be small -- making it small helps reduce the chance for serious bugs that can't be fixed. This smallest part of the boot process cannot be safely re-written and cannot safely be allowed to be overridden.)

For all that it should not be re-programmable in the field, the source should be available to the end-administrator, and there should be some means of verifying that the executable object in the initial boot ROM matches the source that the vendor says should be there.

(3) Internal RAM check.

Most modern CPUs will have some package internal RAM, distinct from CPU registers. It is a good idea to check these RAM locations at this point, to see that what is written out can be read back, using bit patterns that can catch short and open circuits in the RAM.

Just enough RAM should be tested to see that the initial boot-up ROM routines can run safely. If the debug/monitor is a Forth interpreter, it should have enough return stack for at least 8 levels of nested call, 16 native integers on the parameter stack, and 8 native integers of per-user variable space. That's 32 cells of RAM, or room for 32 full address words, in non-Forth terminology.

(I'm speaking roughly, more complex integrated packages will need more than that, much more in some cases. Very simple devices might actually need only half that. The engineers should be able to determine actual numbers from their spec. If they can't, they should raise a management diagnostic flag and put the project in a wait state.)

* Again, if there are errors, there is not much we can do but set a diagnostic flag and do its best to halt safely, whatever halting safely means.

(4) Lowest level diagnostic firmware.

At this point, we can be moderately confident that the debug/monitor can safely be entered, so it should be entered and initialize itself.

The next several steps should run under the control of the debug/monitor.

* Again, if the debug/monitor fails to came up in a stable state, the device should set a diagnostic flag and halt itself as safely as possible.

** This means that the debug/monitor needs a resident watchdog cycle that will operate at this level.

(5) First test/diagnostic device.

We want a low-level serial I/O (port) device of high reliability, through which the technician can read error messages and interact with the debug/monitor.

(Parallel port could work, but it would usually be a waste of I/O pins for no real gain.)

* This is the last point where we want to just set a diagnostic flag and halt as safely as possible on error. Any dangerous side-effects of having started the debug port should be addressed before halting safely at this stage.

(6) Full test of CPU internal devices.

This step can be performed somewhat in parallel with the next step. Details are determined by the internal devices and the interface devices. Conceptually, however, this is a separate step.

All internal registers should be tested to the extent that it is safe to test them without starting external devices. This includes being able to write and read any segment base and limit/extent registers, but not does not actually include testing their functionality.

If the CPU provides automatic testing, this is probably the stage where it should be performed (which may require suspending or shutting down, then restarting the monitor/debug processes).

Watchdog timers should be checked to the extent possible and started during this step.

If there is internal low-level ROM that remains to be tested, or if management requires cryptographic checksum checks on the initial boot ROM, this is the stage to do those.

Note that the keys used here are not, repeat, not the manufacturer's update keys. Those are separate.

However, for all that management might require cryptographic self-checks at this stage, engineers should consider such checks to be exercising the CPU and looking for broken hardware, and not related to security. There should be a manufacturer's boot key, and the checksums should be performed with the manufacturer's boot key, since the initial boot ROM is the manufacturer's code.

How to hide the manufacturer's boot key should be specified in the design, but, if the test port enabled in step (5) allows technician input at this step, such efforts to hide the manufacturer's key can't really prevent attack, only discourage attack.

Even if the device has a proper system/user separation, the device is in system state right now, and the key has to be readable to be used.

The key could be encrypted and hidden, spread out in odd corners of the ROM. There could be two routines to read it, and the one generally accessible through the test port could be protected by security switch/strap and/or extra password. But the supervisor, by definition, allows the contents of ROM to be read and dumped through the test port at this stage. A determined engineer would be able to analyze the code and find the internal routine, and jump to it. Therefore, this raises the bar, but does not prevent access.

Another approach to raising the bar is the provision of a boundary between system/supervisor mode and key-access mode. The supervisor could use hardware to protect the key except when in key-access mode, and could use software to shut down the test port when key-access mode is entered. This would make it much more difficult to get access to supervisor commands while the key is readable, but there are probably going to be errors in the construction that allow some windows of opportunity. It is not guaranteed that every design will be able to close off all windows of opportunity.

Such efforts to protect the boot key may be useful. They do raise the bar. But they do not really protect the boot key, only discourage access.

And legal proscriptions such as that epitome of legal irony called DMCA do not prevent people who ignore the law from getting over the bar.

Thus, the key used to checksum the initial boot ROM must not be assumed to be unknown to attackers. (And, really, we really don't need to assume it is unknown, if we don't believe in fairy tales about protecting intellectual property at a distance. As long as this initial boot ROM can't be re-written. As long as the update keys are separate.)

The extra ROM, if it exists, should not be loaded yet, only tested.

If extra RAM is required to do the checksums, the RAM should be checked first, enough to perform the checksums

All remaining internal RAM should be checked at this stage.

(7) Low-level I/O subsystems.

Finally, the CPU package is ready to check its own fundamental address decode, data and address buffers, and so forth. Not regular I/O devices, but the devices that give it access to low-level flash ROM, cache, working RAM, and the I/O space, in that order.

They should be powered up and given rudimentary tests.

Note that the flash ROM, cache, working RAM, and I/O devices themselves should not yet be powered up, much less tested.

Only the interfaces are powered up and tested at this step, and they must be powered up in a state that keeps the devices on the other side powered down.

* On errors here, any devices enabled to the point of error should be powered down in whatever order is safe (often in reverse order of power-up), diagnostic messages should be sent through the diagnostic port, and the device should set a diagnostic flag and enter as safe a wait state as possible.

** It may be desirable to enter a loop that repeats the diagnostic messages.

It would seem to be desirable to provide some way for a technician to interrogate the device for active diagnostic messages.

** But security will usually demand that input on the diagnostic port be shut down unless a protected hardware switch or strap for this function is set to the enabled position/state. This is one of several such security switch/straps, and the diagnostic message will reflect the straps state to some extent.

This kind of security switch or strap is not perfect protection, but it is often sufficient, and is usually better than nothing. (Better than nothing if all involved understand it is not perfect, anyway.)

** In some cases, the security switch/straps should not exist at all, and attempts to find or force them should be met with the device's self-destruction. In other cases, lock and key are sufficient. In yet other cases, such as in-home appliance controllers, a screw panel may be sufficient, and the desired level of protection.

Straps are generally preferred to switches, to discourage uninformed users from playing with them.

*** However, attempts to protect the device from access by the device's legal owner or lawfully designated system administrator should always be considered highly suspect, and require a much higher level of engineering safety assurance. If the owner/end-admin user must be prevented from maintenance access, it should be assumed that the device simply cannot be maintained -- thus, quite possibly should self-destruct on failure.

(8) Supervisor, extended ROM, internal parameter store.

The initial boot ROM may actually be the bottom of a larger boot ROM, or there may be a separate boot ROM containing more program functions, such as low-level supervisor functions, to be loaded and used during initial boot up. This additional ROM firmware, if it exists, should be constructed to extend, but not replace the functionality in the initial boot ROM.

This extra initial boot ROM was tested in step (6), it should be possible to begin loading and executing things from it now. It would contain the extensions in stepped modules, starting modules necessary to support the bootstrap process as it proceeds.

Considering the early (classic) Macintosh, a megabyte of ROM should be able to provide a significant level of GUI interface for the supervisor, giving end-admins with primarily visual orientation improved ability to handle low-level administration. But we don't have display output at this point, such functionality should be oriented toward the technician's serial port at this stage.

This supervisor would also contain the basic input/output functionality, so it could be called, really, a true "Basic Input/Output Operating System" -- BIOOS. But that would be confusing, so let's not do that. Let's just call it a supervisor.

It could also contain "advanced" hooks and virtual OS support such as a "hypervisor" has, but we won't give in to the temptation to hype it. It's just a supervisor. And most of it will not be running yet.

This remaining initial boot ROM is not an extension boot ROM such as I describe below, but considered part of the initial boot ROM.

There should be internal persistent store that is separate from the extension boot (flash) ROM, to keep track of boot parameters such as the owner's cryptographic keys and the manufacturer's update cryptographic keys for checksumming the extension flash ROM, passwords, high-level boot device specification, etc. It should all be write protected under normal operation. The part containing the true cryptographic keys for the device and such must be both read- and write-protected under normal operation, preferably requiring a security switch/strap to enable write access.

Techniques for protecting these keys have been partially discussed above. The difference is that these are the owner's keys and update keys, and those are the manufacturer's boot keys.

This parameter store should be tested and brought up at this point.

Details such as how to protect it, how to enable access, and what to do on errors are determined by the engineers' design specification.

In the extreme analysis, physical access to a device means that anything it contains can be read and used. The engineering problem is the question of what kinds of cryptological attacks are expected, and how much effort should be expended to defend the device from unauthorized access.

Sales literature and such should never attempt to hide this fact, only assert the level to which they are attempting to raise the bar.

Again, attempts to protect the device from access by the legitimate owner/end-admin should be considered detrimental to the security of the device.

* At this point, reading the owner's keys and update keys from the test port should be protected by security switch/strap and password. But, again, until the boot process has proceeded far enough to be able to switch between system and user mode, the protections have to be assumed to be imperfect.

Providing a key-access mode such as described above for the manufacturer's key should mitigate the dangers and raise the bar to something reasonable for some applications, but not for all.

Some existing applications really should never be produced and sold as products.

(As an example, consider the "portable digital purse" in many cell phones. That is an abomination. Separated from the cell phone, it might be workable, but only with specially designed integrated packages, and only if the bank always keeps a copy of the real data. Full discussion of that is well beyond the scope of this rant.)

(9) Private cache.

If there is private cache RAM local to the first boot CPU, separate from the internal RAM, it should be tested now. Or it could be schedule and set to run mostly in a lesser privileged mode after lesser privileged modes are available.

If there are segment base and limit/extent registers, their functionality may be testable against the local cache.

In particular, if the stack register(s) have segment base and limit, and can be pointed into cache, it might be possible to test them and initialize the stacks into such cache here, providing some early stack separation.

If dedicated stack caches are provided in the hardware, they should be tested here. If they can be used in locked mode (no spills, deep enough), the supervisor should switch to them now.

* Errors at this point will be treated similarly to errors in step 7.

(10) Exit low-level boot and enter intermediate level boot process.

At this point, all resources owned by the boot-up CPU should have been tested.

Also, at this point, much of the work can and should be done in less secure modes of operation. The less time spent in system/supervisor mode, the better.

(10.1) Testing other CPUs.

If there are multiple CPUs, this is the step where they should be tested. The approach to testing the CPUs depends on their design, whether they share initial boot ROMs or are under management of the initial boot CPUs, etc.

From a functional point of view, it is useful if the first boot CPU can check the initial boot ROMs of the other CPUs before powering them up, if those ROMs are not shared. It may also be useful for the first boot CPU to initiate internal test routines on the others, and monitor their states as they complete.

At any rate, as much as possible should be done in parallel here, but care should be exercised to avoid one CPU invalidating the results of another.

* Again, errors at this point will be treated similarly to errors in step 7.

(10.2) Testing shared memory management hardware access, if it exists.

While waiting for the other CPUs to come up, any true memory management hardware should be tested and partially initialized.

At this point, only writing and reading registers should be tested, and enough initialization to allow un-mapped access.

* Again, errors at this point will be treated similarly to errors in step 7. MMU is pretty much vital, if it exists.

(10.3) Finding and testing shared RAM.

Shared main RAM should be searched for before shared cache.

As other CPUs come up, they can be allocated to test shared main RAM. (Really, modern designs should go to multiple CPUs before going to larger address space or faster CPUs, any more.) If there are multiple CPUs, testing RAM should be delegated to CPUs other than the first boot CPU.

This also gets tangled up in testing MMU.

Tests should be run first without address translation, then spot-checked with address translation.

As soon as enough good RAM has been found to support the return address stack and local variable store (one stack in the common case now, but preferably two in the future, a thread heap and a process heap) the supervisor OS, to the extent it exists, should be started now if it has not already been started. (See next step.)

Otherwise, parallel checks on RAM should proceed without OS support.

Either way, the boot ROM should support checking RAM in the background as long as the device is operational. RAM which is currently allocated would be left alone, and RAM which is not currently allocated would have test patterns written to them and read, helping erase data that programs leave behind.

Such concurrent RAM testing would be provided in the supervisor in the initial boot up ROM, but should run in a privilege-reduced state (user mode instead of system/supervisor).

* Usually, errors in RAM can be treated by slowing physical banks down until they work without errors, or by mapping physical banks out. Again, a log of such errors must be kept, and any errors in RAM should initiate a RAM checking process that will continue in the background as long as the device is running.

** If there are too many errors at this point, they may be treated similarly to errors in step 7.

*** Any logs kept in local RAM should be transferred to main RAM once enough main RAM is available (and known good).

(10.4) Testing shared cache.

As other CPUs come up, they can also be allocating to testing shared cache. As with testing main RAM, testing cache should be delegated to CPUs other than the first boot CPU. Also, main RAM comes before cache until there is enough known-good RAM to properly support multiple supervisor processes.

And this also gets tangled up in testing MMU.

Tests should be run first without being assigned to RAM, then again with RAM assigned.

* If there are errors in the cache, it might be okay to disable or partially disable the cache. Engineers must make such decisions.

** Errors at this point errors may still be treated similarly to errors in step 7, depending on engineering decisions. If it is acceptable to run with limited cache, or without cache, some logging mechanism that details the availability of cache must be set up. Such logging would be temporarily kept in internal RAM.

*** The decision about when to enable cache is something of an engineering decision, but, in many cases, once cache is known to be functional, and main RAM has also been verified, the cache can be put into operation.

In some designs, caches should not be assigned to RAM that is still being tested.

(11) Fully operational supervisor.

At this point, most of the remaining functionality of the supervisor (other than GUI and other high-level I/O) should be made available. Multi-tasking and multi-processing would both be supported (started in the previous step), with process management and memory allocation.

One additional function may become available at this point -- extending the supervisor via ROM or flash ROM.

If there is an extension ROM, the initial boot ROM knows where it is. If it is supposed to exist, the checksum should be calculated and confirmed at this point.

The key to use depends on whether the extension has been provided by the manufacturer or the end-user/owner. Manufacturer's updates should be checked with the update key (not the boot key), and owner's extensions should be checked with the owner's key.

Failure would result in a state such as in step (7).

Testing the extension proceeds as follows:

There are at least two banks of flash ROM. In the two bank configuration, one is a shadow bank and the other is an operational bank.

If the checksum of the operational bank is the same as the unwritable extension ROM, the contents are compared. If they are different, the operational bank is not loaded, and the error is logged and potentially displayed on console.

If the checksum of the operational bank is different from the unwritable ROM, it is checked against the shadow bank. If the shadow bank and the operational bank have the same checksum, the contents of the two are compared. If the contents are different, the operational bank is not loaded and the error is logged and potentially displayed on console.

If the contents are identical, the cryptographic checksum is checked for validity. If it is not valid, the operational bank is not loaded, and the error is logged and potentially displayed on console.

* If the operational bank verifies, it is loaded and boot proceeds.

** If the operational bank fails to verify, a flag in the boot parameters determines whether to continue or to drop into to a maintenance mode.

If the device drops into a maintenance mode, the test port becomes active, and a request for admin password is sent out it. A flag is set, and boot proceeds in a safe mode, to bring up I/O devices safely.

(When the operational bank is updated, the checksum checked and verified, and committed, the operational bank is copied directly onto the shadow bank. But that discussion is not part of this rant.)

Other approaches can be taken to maintain a valid supervisor. For instance, two shadow copies can be kept to avoid having to restore the factory extensions and go through the update process again from scratch.

The extensions can override much of the initial boot ROM, but the monitor/debugger must never be overridden. It can be extended in some ways, but it must not be overridden.

There should be no way to write to this flash ROM except by setting another protected hardware switch or strap which physically controls the write-protect circuit for the flash. This switch or strap should not be the same as mentioned in step (7), but may be physically adjacent to it, depending on the engineers' assessment of threat.

*** The initial boot ROM should not proceed to the flash ROM extensions unless said switches or straps are unset.

(12) I/O devices.

(12.1) Locating and testing normal I/O device controllers.

As known good main RAM becomes available, the boot process can shift to locating the controllers for normal I/O devices such as network controllers, rotating disk controllers, flash RAM controllers, keyboards, printers, etc.

There may be some priority to be observed when testing normal I/O device controllers, as to which to initiate first.

It also may be possible to initiate controller self-tests or allocate another CPU to test the controllers, so that locating the controllers and testing them can be done somewhat it parallel.

Timers and other such hardware resources would be more fully enabled at this point.

* Errors for most controllers should be logged, and should not cause the processor to halt. 

(12.2) Identifying and testing devices.

As controllers become available and known good, the devices attached to them should be identified, initialized, and tested.

This might also occur in parallel with finding and testing other controllers.

* Errors for most devices should be logged, and should not cause the processor to halt. 

** Some intelligence about the form and number of logs taken at this point can and should be exercised. We don't want RAM filled with messages that, for example, the network is unavailable. One message showing when problems began, and a count of error events, with a record of the last error, should be sufficient for most such errors.

(12.3) Low-level boot logging.

As video output and persistent store become available, error events should be displayed on screen and recorded in an error message partition. Again, there should be a strategy to avoid filling the error message partition, and to allow as many error notifications as possible to remain on screen.

If the device is booting to maintenance mode, and an admin has not logged in via the test port at this point, the video device may present a console login prompt/window, as well.  Or it may present one for other reasons, such as a console request from the keyboard.

The video display could also have scrolling windows showing current system logs.

Also, parameter RAM flags may prevent console login to a local video device/keyboard pair, requiring admin login at the test port via some serial terminal device.

(12.4) High-level boot.

The supervisor would have hooks and APIs to present walled virtual run-time sessions to high-level OSses, including walled instances of itself and walled instances of Linux or BSD OSses, or Plan Nine, etc., to the extent the CPU can support such things, and to the extent the device is designed to support such things.

And parameter RAM would have flags to indicate whether a boot menu should be provided, or which high-level OSses available should boot.

If walled instances are not supported, only a single high level OS would be allowed to boot, and the supervisor would still map system calls from the high-level OS into device resources.

This is my idea of what should happen in the boot-up process. Unfortunately, most computers I am familiar with do a lot of other stuff and not enough of this.