defining computers: 2017

Misunderstanding Computers

Why do we insist on seeing the computer as a magic box for controlling other people?
人はどうしてコンピュータを、人を制する魔法の箱として考えたいのですか?
Why do we want so much to control others when we won't control ourselves?
どうしてそれほど、自分を制しないのに、人をコントロールしたいのですか？

Computer memory is just fancy paper, CPUs are just fancy pens with fancy erasers, and the network is just a fancy backyard fence.
コンピュータの記憶というものはただ改良した紙ですし、CPU 何て特長ある筆に特殊の消しゴムがついたものにすぎないし、ネットワークそのものは裏庭の塀が少し拡大されたものぐらいです。

(original post/元の投稿 -- defining computers site/コンピュータを定義しようのサイト)

Wednesday, December 13, 2017

Facebook and LinkedIn are out of control again.

I suppose it's an inherent problem of letting the metaphorical postal roads all be run for profit.

Facebook Messenger will not boot on this ancient tablet. It gets stuck in a loop demanding that I should let it see my telephone contact list.(But the tablet is not a phone.)

I suppose the proximate cause of the loop is the age of the OS. It's well beyon life support.

But there's the rub of things.

Why must we let important things like telephones be put at the mercy of the insane technical marketing update cycle?

It's this insane update cycle that is driving the human side of global warming.

Who still can't see this?

Now Facebook itself is no longer satisfied with just being Facebook. It has not shut down safely in more than a week.

And LinkedIn seems to be doing similar things, insisting, simply demanding that I turn my human network over to their engines of marketing.

Anyone willing to front me a cool USD hundred million or so to build a networking service that treats people like people and tech like tech?

[18 Dec. -- Latest updates, still can't get on the messenger app. (I can use an ancient Google Chrome browser to sort-of access it, sometimes. Irony on irony.)

Facebook app still freezes and dies regularly, sometimes taking the computer down with it.]

Saturday, October 7, 2017

Languages in a Common Character Code for Information Interchange

Having said a bit about why I want to re-invent Unicode (so to speak), I want to rant a little about the overall structure, relative to languages, that I propose for this Common Code for Information Interchange, as I am calling it.

I've talked a little about the goals, and the structure, in the past. Much of what I said there I still consider valid, but I want to take a different approach here, look from the outside in a bit.

First, I plan the encoding to be organized in an open-ended way, the primary reason being that language is always changing.

Second, there will be a small subset devoted primarily to the technical needs of encoding and parsing, which I will describe in more detail in a separate rant.

Third, there will be an international or interlocality context or subset, which will be relatively small, and will attempt to include enough of each current language for international business and trade. This will appear to be a subset of Unicode, but will not be a proper subset. I have not defined much of this, but I will describe what I can separately.

Parsing rules for this international subset will be as simple as possible, which means that they will depart, to some extent at least, from the rules of any particular local context.

Third, part two, there will be spans allocated for each locality within which context-local parsing and construction rules will operate.

Fourth, there will be room in each span for expansion, and rules to enable the expansion. Composition will be one such set of rules, and there will be room for dynamically allocating single code points for composed characters used in a document.

The methods of permanently allocating common composed characters should reflect the methods of temporary allocation.

Fifth, as much as possible, existing encodings will be included by offset. For instance, the JIS encoding will exist as a span starting at some multiple of 65536, which I have not yet determined, and the other "traditional" encodings will also have spans at offsets of some multiple of two. The rules for parsing will change for each local span.

I've thought about giving Unicode a span, but am not currently convinced it is possible.

Of course, this means that the encoding is assumed to require more than will fit comfortably in four bytes after UTF-8 compression.

And thinking of UTF-8 brings me to the next rant.

A Common Code for Information Interchange

I've been thinking about this topic since I first heard of plans for what become Unicode, back in the mid 1980s.

At the time, there were many in the industry who still thought that 64K of RAM should be enough for general personal computing, and there were many people who thought 65,536 should be enough characters to cover at least all modern languages. I tried to tell as many people as I could that Japanese alone had more than 20,000 and Chinese had an estimated count in the range of a hundred thousand, and no one believed me. I also tried to tell people that we shouldn't conflate Japanese and Chinese characters, but I had a hard time convincing even myself of that.

I also tried to tell people that the Kanji radicals should be encoded first, but the Japanese standards organization wasn't doing that, so why should anyone believe it?

As I noted recently, some of the problems of the current approach to unifying the world's character sets are becoming obvious.

Each natural human language is it's own context. And each language has it's own set of sub-contexts which we call dialects. But neither the contexts nor sub-contexts nor the sets of sub-contexts are well-defined, mathematically speaking, which means that mathematical methods can not be used to perfectly parse any natural language.

Therefore, programs which work with human languages are necessarily buggy. That is, we know that, no matter how carefully they are constructed, they will always contain errors.

When we combine language contexts, we combine error rates with error rates, and the result is at best multiplicative. It is not simply additive. So we really should not want to do that. But that's what Unicode tries to do -- combine character codes for all relevant languages into one over-arching set.

Actually, for all my pessimism, Unicode works about as well as we should expect it to. I just want something better, but it's hard to describe exactly what that something is. This rant is an attempt to do so.

With just US English, it's fairly easy to construct a text editor. Parsing the entered text requires around ten simple functions, and visual formatting less than ten more. Word processors are fairly straightforward, as well.

With Unicode, a simple text editor requires more like a hundred functions, interacting in ways that are anything but obvious.

And if you need to rely on what you read in the text, as I noted in the rant linked above, you find that displaying the text reliably adds significantly more complexity.

Actually, US English is almost unnaturally simple to parse (relatively speaking). That's why it has been adopted over French, Spanish, Russian, and German, and why you don't hear much of Japanese plans to make Japanese the international language, and why the Chinese Communist Party's dreams of making Chinese the international language just will never fly, no matter how significant a fraction of the world's population ostensibly speaks Chinese as a native or second language.

Memorizing 9000+ characters for basic literacy requires starting at the age of two, I hear.

The Chinese may claim a full third, but the other two thirds are not going to happily and willingly accept being forced to propogandize their children (or themselves) with that many characters just to be literate. That alone is oppressive enough to prevent a productive peace.

Even the Japanese subset of two thousand for school literacy basically requires all twelve years of the primary grades to complete.

If we could reduce that burden by teaching the radicals first (We westerners call the sub-parts of Kanji "radicals".), we might have hope to address the difficulty, but the radicals themselves are an added layer of parsing. That's multiplicative complexity, which is one of the reasons that approach has not been successful as a general approach. (It is taught, I understand, in some schools of Japanese calligraphy, but that is not a large fraction of the population.)

And the rules for assembling and parsing the radicals are anything but simple.

Now, you may be wondering why I think the radicals should be prioritized in the encoding, but the dirty secret of Kanji is that they are not a closed set, any more than English vocabulary is a closed set. Every now and then someone invents a new one.

Methods to address new coinage must be part of the basic encoding.

This is getting long, and I think I'll wrap up my rant on my motivations for considering something to supercede Unicode here.

I wrote up a summary list of overall goals about three years back, here.

As I've said elsewhere, Unicode has served a purpose until now, and will continue to do so for a few more years, but we need something better.

It needs to provide better separation for the contexts of languages.

Tuesday, September 26, 2017

A 16/32-bit Extension for the Venerable 6809

This is somewhat in the vein of my last several posts, but wandering a little, I have for several years (decades, even) been thinking of ways to extend my favorite microprocessor, the 6809, to make it useful in applications that use Unicode text and decent-sized displays.

I don't have the tools to do HTML tables right now, so I'm going to assume you have tables of 6809 instructions, registers, and addressing modes handy to refer to. Otherwise, this may not make a lot of sense.

Before I begin, I do not intend to do anything with the 6309 extensions. Don't ask me to explain much beyond the fact that success in this kind of endeavor requires keeping things as simple as possible.

[JMR20190309:
Greg Miller, who implemented a true cycle-accurate 6809 HDL core, has some opinions on extending the 6309 which might be relevant here, in his github repository of the HDL core.

(This was brought up yesterday/today in the Facebook group for the TRS-80 Color Computer, in the course of a conversation on extending the 6809 along the same lines as the 65816 extending the 6502, along with a chart of one possible way to fill out the instruction set holes in the 6309, and some hand-waving at reasons not to go down that rabbit hole. And, speaking of the 65816, too bad it didn't have one more stack register. I'd have been really interested.)
]

I'm thinking of calling this the X61609, BTW.

The first step in this project is to double the register sizes.

The accumulators will be 16 bits wide, and the indexable registers will be 32 bits wide:

    A and B 16 bits wide.

    X, Y, U, S and PC 32 bits wide.

The direct page register will be widened on both ends, to contain a full width 32-bit direct page base address instead of just the high-order part.

   DP 32 bits wide -- full address.

[JMR20190309:
In a 16 bit address space, requiring the base address of the process-local statically allocatable variable segment/block to be set on a 256-byte boundary was a rather severe limitation to the usefulness of the DP register. Looking back on it now, it seems a bit surprising that Motorola themselves never corrected this structural error in the design.

I suppose, if we expand the addressing range to 32 bits, the limitations of having the bottom 8 bits of the DP register be zero is not as severe as with only 16 bits of addressing. There could even be an argument made for leaving it at 8 bits, keeping the resultant direct page within the bottom 64K of memory space. But I doubt it saves enough time on address cycles, and the number of gates it saves in internal registers is no longer relevant.
]

The condition codes will be augmented by status bits, and will be 32 bits wide.

    CC has extra condition code and status bits.

If you are familiar with the processor and the instruction set, you are likely wondering what magic will make this work. Sign extension plays a big role.

We will need more instructions. We'll borrow a play from the original 6809 and use page escapes to encode them. That means that, in addition to page2 ($10) and page3 ($11) pre-bytes, we'll have pagex1 ($14), pagex2 ($15), and pagex3 ($18). To simplify, the codes will be mapped out the same as the 8-bit instructions.

For immediate (constant) operands, the constant portion of the instructions will be 16 bits for accumulators A and B, and 32 bits for the indexable registers X, Y, U, S, and PC, and also 32 bits for the double accumulator, D.

The double accumulator D for 16-bit instructions will be the concatenation of A:B in their full width, with A being the top 16 bits.

For 8-bit instructions, the 16-bit result will generally be the sign extension of the 8-bit result. For example,

    LDA #$-1

will load the 8 bits of the argument and sign extend it, so that the entire 16 bits of A will be -1. (Bit rotate and shift instructions will be the exceptions that prove this rule.)

For 8-bit instructions involving D, D will be the concatenation of the lower 8 bits of the result A and B, sign extended, when D is the target of the instruction.

Likewise, when loading a sixteen bit value into an indexable register, it will be sign extended.

    LDX #$8000

will result in X containing $FFFF8000. Appending "W" to instructions to specify their 16-bit forms,

    LDW X #$8000

is the form required to get $8000 into X, and it will consume 6 bytes of instruction space:

    $14 $E8 $00 $00 $80 $00

This may feel strange, but it should yield the best emulation of the 6809 in existing hardware designs, by decoding the bottom 32K of memory as the low half of 6809 address space and the top 32K as the high half.

Shifts and rotates require a bit of special attention. the 6809 carry bit will be retained, and will alway reflect the carry result defined according to the 6809, in other words, carries from bit 7 mostly, and from bit zero in right rotates, etc. There will be an additional 16 bit carry that will be set according to the 16-bit results, mostly carry from bit 15, etc., and will provide the source for carry in sixteen bit instructions, too.

For register exchange and transfer insfructions, whole registers will alway be exchanged or transferred, except for DP and CC. Bits 8-15 of DP will be involved in 8-bit transfers and exchanges, clearing the low byte of DP and extending the sign into the high word as necessary. DP will be allowed in exchanges and transfers with indexable registers, in which case all 32 bits will be affected.

The status register will be allowed in transfers and exchanges with D, in which case all 32 bits will be affected (at least in system mode).

Stack operations may seem to be a cause for concern, but PSH and PUL will be in both the 8- and 16- bit maps, and the 8-bit push and pop will behave as it does in the 6809, where the 16-bit will do the logical thing. The one point of concern is that 8-bit PULs will sign-extend their results, to maintain compatibility.

8-bit SWIs will behave as in the 6809, and 16-bit SWIs will save entire registers.

The traditional interrupt and reset vectors will be sign extended before being used. Addtional 16-bit (thus 32-bit address) interrupt and reset inputs and vectors will be provided.

Finally, the index postbytes will remain almost as they are, with the exception that post-inc/dec modes will increment or decrement by 2 or 4 when used by 16-bit instructions. (I considered multiplying all offsets by two for 16-bit instructions, but it looks like that would complicate, rather than simplify things.)

Also, there will also be at least three new indexing modes --

    1RRI1010 will provide 32-bit offsets from X, Y, U, and S.
    1XXI1110 will provide 32-bit offsets for PC
    101I1111 will provide 32-bit absolute addresses for indirection. (Tentative.)

And I guess a DP relative indexed mode would be useful, to allow using DP variables as extra index registers (6502 envy). There should be room in the undefined postbyte values. But the slickest solution, slipping this mode into the don't care assignments for PC relative, depends on how much ROMmed software there is that used something besides 0s in the don't care bits. It could be switched on and off with a (system-mode-only) status bit, perhaps.

[JMR20190309:
Specifically, I want to be able to use these two currently illegal addressing modes:

LDB [<CURRENT_BYTE] ; Think of input buffer operations.

and

LEAY <CACHED_RECORD

The former currently requires something like

PSHS X ; X probably has something important in it.
LDX <CURRENT_BYTE
LDB ,X ; Just want to check it.
PULS X

Sure, unless I can also use auto-inc or offsets on DP variables, that's a bit limited in usefulness. But the latter currently requires something like

PSHS D,Y
CLRB
TFR DP,A
TFR D,Y
PULS D
* Do stuff, and don't forget to
PULS Y

And popping Y back off the stack is sure easy to forget.
]

From this point, the next priority would be memory management and support for isolating instances of emulated systems. That will be for another time.

And, if there is room left over after that, it would be nice to add bit-level primitives for multiply and divide, such that a 32-bit multiply or divide could just be a sequence of 32 of the primitives. (The primitive would conditionally add or subtract according to the bit being examined and then shift the result one, ready to examine the next.)

Friday, June 30, 2017

Keeping the Return Address Stack Separate

So far, in this extended rant, I have

tried to show how to look at relevant addresses on your computer
tried to explain how current OSses still let memory regions clash
and talked about using existing processor tech to mitigate the problem.

Now I want to show how a CPU could completely prevent an entire subclass of these attacks, without a whole lot of loss in processor speed.

I first came on these ideas twenty, maybe thirty years ago, when trying to figure out what made the M6809 such a magical microprocessor. (That's a subject for another day.) The M6809 was (and still is) an 8-bit microprocessor with a 16-bit address bus. That means it can address 64Kbytes of memory.

Motorola specified a memory management part called the 6829 which supported designs up to 1 megabyte of memory. It was essentially a block of fast RAM that would be used to translate the upper bits of memory, plus a latch that would select which parts of the RAM would be used to translate the upper bits of the address bus, something like this:

(This is from memory, and not really complete. Hopefully, it's enough to get the concept from.)

Memory management control would provide functions like write protect and read protect, so you could keep the CPU from overwriting program code and set parts of the address space up as guard pages.

With 32 bits and more of address, memory management doesn't quite work this simply, but this is enough to get some confidence that memory management can actually and meaningfully be done.

Now, if you are familiar with the 8086, you may wonder what the difference between the 8086's segment registers and this would be. This kind of scheme provides fairly complete control over the memory map.

The 8086 segment registers only moved 64k address windows around in the physical map, and provided no read or write control. Very simple, but no real management. The 80286 provided write protect and such, but the granularity was still abysmal, mostly based on guesses about the usage of certain registers, guesses which sort of worked with some constrained C programming language run-time models. And these guesses were frozen into silicon before they were tested. It should go without saying, that such guesses miss the mark for huge segments of the industry, but Intel's salescrew has always been trained in the art of smooth-talking.

(Intel are not the only badguys in the industry, they are just the ones who played this particular role.)

Now I knew about both of these approaches, and I knew about the split stack in Forth. And it occurred to me that, if a 6829-like MMU could talk to the CPU, and select a different task latch on accesses through the return stack pointer (S in the 6809), you could make it completely impossible to crash the return stack and overwrite return addresses.

I'm not talking about guard pages, I'm talking about the return addresses just simply can't be accesses by any means except call and return. They're outside the range of addresses that application code can generate.

Of course, the OS kernel can access the stack regions by mapping them in, but we expect the OS to behave itself.

(We would provide system calls to allow an application to have the OS adjust a return address when such is necessary.

Also, since we are redesigning the CPU, we might add instructions for exceptional return states, but I would really rather not do that. It seems redundant, since split stacks make multiple return values so much easier.)

Another thing that occurred to me is that the stack regions could be mapped to separate RAM from the main memory. This would allow calls and returns that would take no more time than regular branches or jumps.

At this point in my imaginings, I'm thinking about serious redesign of the CPU. So I thought about adding one more stack register to the 6809, a dedicated call/return stack. It would never be indexed, so it would be a very simple bit of circuitry. That would free up registers for other use, including additional stacks and such.

(Well, if we allow frame pointers to be pushed with the instruction pointers, and provide instructions for walking the stack, there would be one kind of indexing -- an instruction to fetch a frame pointer at a specific level above the current one. I'll explain how this would work, to aid understanding what's going on here:

There would be a couple of bits in the processor status area, which the OS would set before calling the application startup. The application must not be allowed to modify these bits, but, since the application must be able to confirm that the frames are present, it should be able to read them.

These bits would tell the processor which stack pointers to save with the instruction pointer on calls. The return instructions would have a bit field to determine whether to restore or discard each saved frame pointer.

"Walking the stack" would be simply a load of a specified saved frame pointer at a specified level of calling routine.

In the example shown here, the instruction GETFP sees from the status register that both LP and SP are being recorded, and multiplies the index argument by 3, then adds 2 to point to LP, checks against the return stack base register, and loads LP0 into X.

But GETFP SP,3,Y after pointing into stack that isn't there, checks the return stack base register and refuses to load the frame pointer that isn't there.

Another flag in the status register might select between generating an exception on failure and recording the failure in a status bit.

Maybe. :-/)

Could we do such a thing with the 68000 or other 32-bit CPUs? Add a dedicated call/return stack and free the existing stack pointer for use as a parameter stack in a split-stack architecture? 64-bit CPUs?

Sure.

But if we intend to completely separate the return addresses, we have to add at least one bit of physical address, or we have to treat at least one of the existing address bits (the highest bit) in a special way. I think I'd personally want to lean towards adding a physical address bit, even for the 64-bit CPUs, to keep the protection simple. But, of course, there are interesting possibilities with keeping the physical and logical addresses the same size, but filtering the high bit in user mode.

And that would provide us with a new kind of level-1 cache -- 8, 16, or maybe 32 entries of spill-fill cache attached to the call/return stack, operating in parallel with a (modified) generalized level-1 cache. The interface between memory management and cache would need a bit of redesign, of course.

I'm not sure it would mix well with register renaming. At bare minimum, the call/return stack pointer would have to be completely separate from the rename-able registers.

Would this require rewriting a lot of software?

Some, but mostly just the programming language compilers would need to be worked on.

And most of the rewrite would focus on simplification of code designed to work around the bottleneck of having the return addresses mixed in with parameters.

There you have a way to completely protect the return addresses on stack.

What about other regions of memory? Can we separate them meaningfully?

That kind of thing is already being done in software used on real mainframes, so, yes. But it does have a much larger impact on existing software and on run-time speed, and it is not as simply accomplished.

But that's actually a question I want to visit when I start ranting about the ideal processor that I want to design but will probably never get a chance to. Later.

Wednesday, June 28, 2017

Proper Use of CPU Address Space

I referred to this in my overview about re-inventing the industry, but I was not very specific. Now that I've been motivated to write a rant or two about memory maps and how they can be exploited:

I can write about the ideal, perfect CPU (that may be too perfect for this world), and how it works with memory.

First, these are the general addressable regions of memory that you want to be able to separate out. I'll put them in the order I've been using in the other two rants:

0x7FFFFFFFFFFFFFFF
stack (dynamic variables, stack frames, return pointers)
0x7FFFxxxxxxxxxxxx ← SP
gap
guard page (Access to this page triggers OS responses.)
gap
heap (malloc()ed variables, etc.)
statically allocated variables
0x4000000000000000
application code
0x2000000000000000
operating system code, variables, etc.
0x0000000000000000

The regions we see are

Stack (dynamic variables, stack frames, return pointers)
Heap (malloc()ed variables, etc.)
application code (including object code, constants, linkage tables, etc.)

Operating system code should include the same sort of regions. But that should not really be visible in the application code map. Only the linkage to the OS should be visible, and that would be clumped with the application code.

Memory management hardware provides the ability to move OS code out of the application map. Let's see how that would look:

We used to talk about the problems of accidentally using small integers as pointers. Basically, when pointer variables get overwritten with random integers, the overwriting integers tend to be relatively small integers. Then when those integers are used as pointers, they access arbitrary stuff in low memory. We can notice that and refrain from allocating small integer space. And we realize that we have already dealt with small negative integers by buffering the wraparound into highest memory:

0xFFFFFFFFFFFFFFFF
gap (wraparound and small negative integers)
0x8000000000000000
stack (dynamic variables, stack frames, return pointers)0x7FFFxxxxxxxxxxxx ← SP
gap
guard page (Access to this page triggers OS responses.)
gap
heap (malloc()ed variables, etc.)
statically allocated variables
0x4000000000000000
application code
0x0000000100000000
gap (small integers)

I've posted a rant about using a split stack, with a little of the explanation for why at the end. Basically, that would allow us to move those local buffers that can oerflow, crash, and/or smash the stack way away from the return address stack.

Thus, even if the attacker could muck in the local variables, he would still be at least one step from overwriting a return address. That means he has to use some harder method to get control of the instruction pointer.

Stack usage patterns actually point us to using a third stack, or a stack-organized heap separate from the random allocation heap. Parameters and small local variables could be on one stack, and large local variables on the other.

In other words, scalar local/dynamic variables would be on the second stack and vector/structure local/dynamic variables on the third. This would be especially convenient for Forth and C run-times, virtually eliminating all need of function preamble and cleanup, and simplifying stack management.

Another way to use the third stack would be to just put all the local variables on it. It might be easier to understand it this way, and I'll use the parameter/locals division below. As far as the discussion below goes, the two divisions can be interchanged. (The run-time details are significant, but I'll leave that for another day. Besides, there is no reason for a single computer to limit itself to one or the other. With a little care, the approaches could even be mixed in a running process.)

But the third stack could be optional, and its use determined by the language run time support. The OS run-time support really doesn't need to see it other than as a region to be separated from the others. Here is a possible general map, using 64 bit addressing:

0xFFFFFFFFFFFFFFFF
gap (wraparound and all negative integers)
0x8000000000000000
gap (large positive integers)
0x7FFFFF0000000000
gap
return stack ← RP
gap
0x7FFFFE0000000000
guard page (2⁴⁰ addresses)
0x7FFFFD0000000000
gap
parameter stack ← SP
gap
0x7FFFFC0000000000
guard page (2⁴⁰ addresses)
0x7FFFFB0000000000
gap
local stack ← LP gap
0x7FFFFA0000000000
guard page (really huge)
0x7000000000000000
gap
heap (malloc()ed variables, etc.)
gap
0x4000020000000000
guard page (2⁴⁰ addresses)
0x4000010000000000
gap
statically allocated variables
gap
0x4000000000000000
gap
application code
gap
0x0000010000000000
gap (small positive integer pointer guard)
0x0000000000000000

If we choose to have stack frames, we could manage them very simply on the return stack by just pushing the local and/or pointer stack pointer when we push the IP. And we just discard them when we pop the IP. Or we can pop them, to force-balance the stack. This gets rid of pretty much all the complexity of walking the stack.

The gaps should be randomized, to make it harder for attacker code to find anything to abuse.

The regions we now have are

Return Stack (return address and maybe frame pointers)
Parameter stack (parameters only)
Locals Stack (dynamically allocated local variables)
Heap (malloc()ed variables, etc.)
Statically allocated process variables (globally and locally visible)
application code (including object code, constants, linkage tables, etc.)

And we have large guard regions between each.

What's missing?

Multiprocessing requires a region of memory dedicated to process (or thread) shared variables, semaphores, resource monitor counters and such. This is a separate topic, but basically the statically allocated variable area would have a section which could be protected from bare writes, with only reads and locked read-modify-write cycle instructions allowed. These would be a separate region, so their addresses could be somewhat randomized.

I'm not sure that it makes sense to manage allocation of shared variables in the malloc() sense, but there is room with this kind of scheme, and modern processors should support that many different regions of memory.

Also, regions of memory shared mmap-style would be in a separate region, or perhaps a guarded region for each. I'm not sure whether the would be protected in the same way as semaphores and monitor counters. It would seem, rather that the CPU instructions would be ordinary instructions, and the mmap region would be a resource protected by semaphore- or monitor- controlled access.

We can do the same sort of thing with 32-bit addressing, although, instead of guard pages 2⁴⁰ or so in size, we would be looking at guard pages between 2²⁰
and maybe 2²⁴ in size. This would be more appropriate for some controller applications.

We could do the same thing with 16-bit addressing, but it wouldn't leave us much room for the variables and code. On the other hand, looking twice at 16-bit addressing will give us clues for further refinement of these ideas. But I think I'll save that for another rant, probably another day. I have burned up enough of today on this prolonged rant.

Monday, June 12, 2017

Reinventing computers.

I mention my bad habits a bit, but I don't really go into much detail:

One of these days I'll get someone to pay me to design a language that combines the best of Forth and C.
Then I'll be able to leap wide instruction sets with a single #ifdef, run faster than a speeding infinite loop with a #define,
and stop all integer size bugs with my bare cast.

I recall trying to get a start ages ago, on character encoding, CPUs, and programming languages, and, more recently, more on character encoding.

These are areas in which I think we have gone seriously south with our current technology.

First and foremost, we tend to view computers too much as push-button magic boxes and not enough as tools.

Early PCs came with a bit of programmability in them, such as the early ROMmed BASIC languages, and, more extensively, toolsets like the downloadable Macintosh Programmers' Workbench. Office computers also often came with the ability to be programmed. Unix (and Unix-like) minicomputers and workstations generally came with, at minimum, a decent C compiler and several desktop calculator programs.

Modern computers really don't provide such tools any more. It's not that they are not available, it's that they are presented as task-specific tools, and you often have to pay extra for them. And they are not nearly as flexible (MSExcel macros?).

Computers were not given to us to use as crutches. They were given to us to help us communicate and to help us think.

I'm not alone in my interest in retro-computing, but I think I have a little bit unusual ultimate goal in my interests.

I want to go back and re-open certain paths of exploration that the industry has lopped off as being too unprofitable (or, really, too profitable for someone else).

One is character encoding. Unicode is too complicated. Complicated is great for big companies who want to offer a product that everyone must buy. The more complicated they can make things, the harder it is for ordinary customers to find alternatives. And that is especially true if they can use patents and copyrights on the artificial complexities that they invent, to scare the customer away from trying to solve his or her own problems -- or their own corporate problems, in the case of the corporate customer.

Computer are supposed to help us solve our own problems, not to impose our own solutions on unsuspecting other people, while making them pay for the solutions that really don't solve their problems.

Now, producing something simpler than Unicode is going to be hard work, harder even than putting the original Unicode together was.

Incidentally, for all that I seem to be disparaging Unicode, the Unicode Consortium has done an admirable job, and Unicode is quite useful. They just made a conscious decision to try not to induce changes on the languages they are encoding. It's a worthy and impossible goal.

And they should keep it up. Even though it's an impossible goal, their pursuing that goal is enabling us to communicate in ways we couldn't before.

But we must begin to take the next step.

Rehashing,

The encoding needs to include the ability to encode a single common set of international characters/glyphs in addition to all the national encodings.
It needs to include
- characters,
- non-character numerics,
- bitmap and vector image,
- and other arbitrary (binary/blob) data.
It needs to be easily parsed with a simple, regular encoding grammar.
And it needs to be open-ended, allowing new words and characters to be coined on-the-fly.

Another path involves CPUs. Intel wants us all to believe that they own the pinnacle of CPU design, but, of course, that is just corporate vanity.

In the embedded world, lots of CPUs that the rest of the world has forgotten are still very much in use, because their designs are optimal in specific engineering contexts. Tradition is also influential, but there are real, tangible engineering reasons that certain non-mainstream CPUs are more effective in certain application areas. The complexity and temporal patterns of the input will favor certain interrupt architectures. The format of the data will favor specific register set constructions. Etc.

Many engineers will acknowledge the old Motorola M6809 as the most advanced 8-bit CPU ever, but it seems to have been a dead-end. ("Seems." It is still in use.) "Bits of it lived on in the 68HC12 and 68HC16." But the conventional wisdom is now that, if you need such an advanced CPU, it's cheaper to go with a low-end 32-bit ARM processor.

What got left behind was the use of a split stack.

The stack is where the CPU keeps a record of where it has been as it chases the branching trails of a problem's solution. When the CPU reaches a dead end, the stack provides an organized structure for backtracking and starting back down new branches in the trail.

Even "stackless" run-time environments tend to imitate stacks in the way they operate, because of a principle called problem context, in addition to the principle of backing out of a non-workable solution.

But the stack doesn't just track where the CPU has been. It also keeps the baggage the CPU carries with it, stuff called local (or context-local) variables. Without the data in that baggage, it does no good for the CPU to try to back up. The data is part and parcel of where it has been.

Most "modern" CPUs keep the code location records in the same memory as the context-local data. It seems more efficient, but it also means that a misbehaving program can easily lose track of both the context data and the code location at once. When that happens, there is no information to analyze about what went wrong. The machine ends up in a partially or completely undefined state.

Worse, in a hostile environment, such a partially defined state provides a chance for attacking the machine and the persistent data that it keeps on the hard disk. (Stack crashes are most effective when the state of the program has already become partially undefined.)

[JMR201706301711: I've recently written rather extensively on stack vulnerabilities and using a split stack to reduce the vulnerabilities:

tried to show how to look at relevant addresses on your computer
tried to explain how current OSses still let memory regions clash
and talked about using existing processor tech to mitigate the problem.

]

Splitting the stack allows for more controlled recovery from error states that haven't been provided for. In the process, it reduces the surface area susceptible to attack.

The split stack also provides a more flexible run-time architecture, which can help engineers reduce errors in the code, which means fewer partially-defined states.

There are a couple of other areas in which so-called modern CPUs in use in desktop computers and portable data devices are not well matched to their target application areas, and the programming languages (and operating systems), reflecting the hardware, are likewise not well-matched. This is especially true of the sort of problems we find ourselves trying to solve, now that we think most of the easy ones have been solved.

In order to flesh out better CPU architectures, I want to build a virtual machine for the old M6809, then add some features like system/user separation, and then design an expanded address space and data register CPU following the same principles.

I'm pretty sure it will end up significantly different from the old 68K family. (The M6809 is often presented as a "little brother" to the 68K, but they were developed separately, with separate design goals. And Motorola management never really seemed to understand what they had in the 6809.)

Once I have an emulator for the new CPU, I want to develop a language that takes advantage of the CPU's features, allowing for a richer but cleaner procedural API that becomes less of a data and time bottleneck. It should also allow for a more controlled approach to multiprocessing.

And then I want to build a new operating system on what this language and CPU combination would allow, one which would allow the user to be in control of his tools, instead of the other way around.

This is what I mean when I say I am trying to re-invent the industry.

Saturday, May 27, 2017

A New Kind of SPAM (Unsolicited Advertising)

First, let's get one thing straight. The true meaning of "SpAM" as an epithet for unsolicited advertising is

SPanning Advertisement Message

I know this because I wasn't there. ;-)

This was told to me by a fellow student when I was an undergraduate at BYU during the days before the Internet was the Internet. The semi-apocryphal story referenced the cool idea that wasn't so cool after all, of taking a "span" of the usenet newsgroups and crossposting an announcement to them all.

For instance, if you wanted to advertise a conference on religious studies, you might send a message to "soc.religion.*", meaning all the newsgroups under soc.religion.

Since mailing lists have pretty much replaced the usenet newsgroups, the number of groups in general, and the number under soc.religion, in particular, has fallen drastically. Indeed, it's almost hard to find access to usenet itself, any more.

(Google hasn't helped, really, in trying to merge usenet with their Google Groups. Maybe that merger wasn't what they were trying to do, but that has been the result. And the independent usenet is disappearing.)

Discussing the pros and cons of usenet, both as a concept, and as it really happened, is not what I want to do today.

It used to be that there were more than fifty newsgroups under soc.religion:

soc.religion.atheists, ... soc.religion.bahai, soc.relgion.catholic, ... soc.religion.christian, ... soc.religion.lutheran, ... soc.religion.mormon, soc.religion.universalist, ...

And some of the groups had subgroups, like soc.religion.mormon.meetinghouse. (In this particular case, the parent soc.religion.mormon group was supposed to be for "serious" discussion, and the meetinghouse group for the kind of talk you'd hear after the meetings were over. Each subgroup had it's own reasons and rules.)

Each group had tens, hundreds, and even thousands of subscribers. One spanning message could result in an avalanche of hundreds of thousands, or even millions of messages flooding the lines in those good old days of slow network access. (Our lines were tens of thousands times slower back then.)

People wanted their bandwidth for other things, of course.

But the business types wanted their pipes into people's mailboxes.

Because it takes real money to maintain the internetwork connections that compose the Internet, the business aspect of the Internet has not disappeared, and I guess it won't disappear any time soon. (I haven't enabled ads in this blog yet, but several other blogs have ads enabled, as a kind of moral nod to Google's largess. If my novels were selling, ...)

I've blogged about the archetypical spam before. Most of it is pretty transparent, if you can look at yourself with any semblance of detachment, and if you have any sense of whom you know and whom you don't know.

Messages from people you don't know are to be suspect, first and foremost. They are especially to be suspect if they make some sort of offer.

If the offer is unreasonable, such as money or love for nothing, it's going to be scarier if it is real. Seriously.

A million dollars from someone you've never met?

You'd better believe, if the money is real, it doesn't come without strings attached, burdensome strings, strings that are very hard to get rid of.

Money is like that anyway -- it always comes with strings attached. Big money comes with ropes.

Sex? P0rn? Drugs? It's the same. If it claims to come at no cost, you should be running away from it at full speed.

[JMR201707050803:

Here's part of why p0rn is bad, BTW: it seems to offer intimacy at no cost. It isn't real intimacy, and there is definitely a cost.

]

The Love of Jesus? Well, that's different. Except that, even Jesus expects you to follow His teachings if you want to benefit from being saved.

Good things take some sort of effort, if you really want the good things to be good.

If everyone in the world could understand that, no one would respond to fraudulent junkmail, and the fraudulent stuff would disappear.

(Well, if everyone understood, the need for sending the SPAM would disappear
pretty quickly, too.)

But we have SPAM. Some of the come-ons are getting more sophisticated. Here's one I got today:

How Secure is your Wi-Fi ?
Sender: anonymity@privacyissafety.com

(I changed the URL so I'm not advertising a company I know nothing of, but it was similarly buzzword-cool.)

Does that sound suspicious to you? It sounds suspicious to me. At least, any website domain name composed of buzzwords bears investigation. Do not take them at face value.

Let's look at what it says:

Wondered if anyone could be sniffing on every bit of your oline activities and data? , yes its just started happening..The US senate just recently passed a bill (S.J.RES 34) which allows your ISP (Internet Service Provider) to sell your web browsing histories and geolocation data to advertisers and partner companies. Yes, you read that right, your privacy is up for sale with the support of the US Senate.

Not Cool right?

Yes so uncool!

If you are a native English speaker, you may get a feeling as if someone who doesn't know how to be cool is trying to sound cool. But that is not the big problem.

What is SJ res 34?

Here is something a search brings back:

https://www.congress.gov/bill/115th-congress/senate-joint-resolution/34

This doesn't make sense, but it appears that the collected Congresscritters have indeed sold your Internet-related private information down the river. It doesn't make sense, but, if it hasn't really happened yet, it's going to happen. When you tell people to make your laws and then take the leash off them, they will make all sorts of laws that help keep them and their friends in power. Expecting them to refrain from doing so makes even less sense than the things they do.

(This is the real danger in allowing NSA to do what they do. If they can do it, why can't Verizon? Especially since it's Verizon and the other ISPs who do the actual eavesdropping? And Google. Who can disobey King Money?)

I'll have to do more research on SJ resolution 34, but, in the meantime, let's look some more at this unrequested advertisement.

We cant let them do this, so here is how to easily take charge of your privacy from the deep root.

If you want to encrypt the online activity on all your business or family’s devices, set up a VPN router. A VPN router encrypts internet traffic at the source by default—so you won’t have to remember to switch on your VPN each time you start a device.

It's an ad for a VPN router.

If you think you are interested, you have to understand. Your provider can't send a request out for you without knowing what request it is sending out and where it is going. The problem being referred to here is how long they can keep the logs of your requests, and whether they can sell them.

VPN doesn't change that. Even if the VPN company proxies all your requests, the only thing changed is that you are now trusting two companies -- your provider and your VPN provider. And you have no promise, no reason to believe that they will keep each other in check.

VPN is a legitimate service, even if the ads are borderline and unrequested. But you really would prefer to have someone you know personally run your VPN.

Without investigating the company who ostensibly sent this, I can't say they are either legitimate or fraudulent. And I don't particularly want to, because then I wouldn't have the marginal junkmail example that I want. For now, I can call this a

marginal junkmail

And that is the thing I want to rant about today.

Jon Postel was a wonderful guy. He was also an idealist.

He set up a relatively clean foundation for the Internet we have, in which all participants could be equal.

The problem with equality is that we don't all behave as if we are equal.

Apple (the computer company) has to have their cachet. So does Microsoft.

Why?

Because they don't know how to be equal.

Google used to know how to be equal, but that only lasted until they had to start turning a profit. (Hey, that's what happened to Microsoft and Apple, too, isn't it?)

If you want to play as an equal with the "big boys", you have to learn how to play by their rules. That means you have to be able to read advertisements and figure out whether they are legitimate or not, and then start figuring out whether what they are advertising is something you need. And you have to have money and technical expertise to set up your own infrastructure. And you have the time, or you have to be able to hire someone who can do it for you.

In the fields where you can do that, you can be okay as a peer. In the fields where you can't, you are not so equal.

The egalitarian Internet has to be there.

But we do need closed networks, as well.

The question I want to try to make obvious here is this --

Who do you trust to build your closed networks?

And the answer to that actually lies in something Jon Postel and his buddies tried to build into the Internet, something that the current crop of providers are trying to dismantle:

Decentralization.

Distributed networks, distributed control.

Distributed control means that you control your part

And you can't control your part when you don't have some sort of understanding how it works. Nor can you when you can't control the hardware that runs your part of it. Nor when you can't control how it connects with the rest.

Which is really the reason for this blog.

(And the rant I posted several days back about the modern Danegeld.)

Friday, April 28, 2017

LSB 1st vs. MSB 1st -- Ignore Gulliver

I was working on a programming problem for a novel I'm trying to write, and got this bug under my skin again.

Something like fifty years ago, an argument raged among computer engineers over the order in which numbers should be stored in computer memory.

During the arguments, some (mostly among the least-significant-first camp, IIRC) pointed out the Lilliputian argument between the "Little-Endians" and "Big-Endians" in Gulliver's Travels. The least-significant-first camp claimed the position of Little-Endian-ness, which left the most-significant-first camp associated with High Church once the allegory was commonly accepted.

In Gulliver's Travels, the arguments between Lilliput and Blefuscu, including the endianness argument, are depicted by Swift as mere matters of fashion.

Most of the least-significant-first camp took the same approach: In memory, endianness doesn't matter.

This was a bit of duplicitous implicit poisoning of the well, similar to the habit Intel salescrew had a decade or two later of claiming that Turing complete CPUs were all equivalent, therefore we should all buy their self-proclaimed most popular CPU -- false analogy and a lot of logical steps being skipped among other things.

To summarize the question of byte order, we need to take a general look at data in computer memory. Computer memory is organized as an array of sequentially addressable elements, which implies an ordering to the contents of memory:

Example 1, no number.
00	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22
D	a	t	a		i	n		m	e	m	o	r	y	.	0	0	0	0	0	0	0	0

Let's put the number 123456 (one hundred twenty three thousand four hundred fifty-six) encoded as text after the description:

Example 2, number 123456 as text.
00	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22
D	a	t	a		i	n		m	e	m	o	r	y	:		1	2	3	4	5	6	0

Note that text is naturally recorded most significant digit first in English.

(Thousands group separators just get in the way in computers, so I just left the comma out.)

If we wrote 123456 textually least significant digit first, it would look like this:

Example 3, number 123456
as least significant digit first text.
00	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22
D	a	t	a		i	n		m	e	m	o	r	y	:		6	5	4	3	2	1	0

You may be wondering why someone would write numbers backwards, but there are actually language/numeric contexts in which least significant digit first is the common order. (They may be useful to refer to later.) Even in English, we have contexts like dates where the order is not most significant first:

September 17, 1787 (mixed order) and
17 September 1787 (least significant first)

So we know that it is not completely unthinkable to do such a thing.

Now, text is actually encoded in computer memory as strings of numeric codes. Let's look at the data in the second example, reading it as hexadecimal numbers that represent the characters of the text instead of interpreting it as text:

Example 2, view 2, (ASCI/Unicode UTF-8),
raw contents displayed in hexadecimal.
00	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22
44	61	74	61	20	69	6E	20	6D	65	6D	6F	72	79	3A	20	31	32	33	34	35	36	00

That's interesting, isn't it?

No?

Okay, let's leave everything but the number interpreted as text:

Example 2, view 3,
numeric text "123456" uninterpreted and shown in hexadecimal.
00	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22
D	a	t	a		i	n		m	e	m	o	r	y	:		31	32	33	34	35	36	00

Now, we haven't actually been changing what is in memory in Example 2. We're just changing how we look at it. We are trying to get a sense of what is actually stored in memory. (If you have a decent OS, you have command line tools like hexdump that allow you to look at files this way. You should try it some time.)

So, now let's try changing the form of the number. Instead of putting it in as text, let's put it in as a number -- an integer. (It's convenient that the address where it will go is 16, for something we call alignment, but we won't really talk about that just yet.)

First, we need to rewrite 123456 (one hundred twenty-three thousand four hundred fifty-six) as a hexadecimal number:

123456 ÷ 16⁴ = 1 rem 57920
57920 ÷ 16³ = 14 (E₁₆) rem 576
576 ÷ 16² = 2 rem 64
64 ÷ 16¹ = 4 rem 0

So,

123456 == 1E240₁₆

Two hexadecimal digits take one byte in memory on a computer with an 8 bit byte.

(Numbers up to 4294967295 (FFFFFFFF₁₆) can be stored in four bytes on computers with an 8 bit byte.)

Let's look at 123456 (1E240₁₆) stored at location 16, most significant byte first:

Example 4 MSB,
123456 (1E240₁₆) directly in memory, MSB first.
00	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22
D	a	t	a		i	n		m	e	m	o	r	y	:		00	01	E2	40	0	0	0

Now let's look at the same number, stored least significant byte first:

Example 4 LSB,
123456 (1E240₁₆) directly in memory, LSB first.
00	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22
D	a	t	a		i	n		m	e	m	o	r	y	:		40	E2	01	00	0	0	0

For a CPU that is MSB first, it will always store and read MSB first (as in example 3), so there's no problem.

And an LSB first CPU will always store and read LSB first, so, again, no problem.

The CPU is built to do it one way or the other, and it will always do it the way it's built, so there's no problem here.

That's the essence of the argument.

It's no longer true, and it really was never true. All bus masters have to agree on how they store numbers in a particular chunk of data or the numbers get turned upside down. (Or in the case of mixed mode integers, inside out and upside down, etc.)

Back then, however, CPUs did not usually have the ability to switch byte order without a bit of work. And alternate bus masters were not as common as now, and usually tended to be built specifically for the CPU.

These days, with intelligent I/O devices, alternate bus masters are rather common. (Graphics co-processors, network interfaces, disk drive interfaces, etc.) If one is going to be a bad boy and do byte order backwards from the rest, unless you isolate the bad boy somehow, things tend to fall apart.

But even the ability to switch byte order does not materially change the arguments.

On a CPU that can switch byte order natively, byte order becomes just another property of the integer stored in memory, which the programmer must keep track of, along with the address, size, presence of sign, etc. As long as the software and hardware respect the properties of the integer in memory, there is no problem.

Well, no problem in isolation.

But there is one virtual bus master that tends, in most of the world, to be most significant first when looking at numbers -- the human who might debug the program by looking at the raw contents of memory without access to the detail of the compiled program.

No number exists in isolation.

There it is, the fatal assumption of the argument:

... in isolation ...

Nothing in this world exists in isolation.

Why am I going around in circles on this subject?

In modern hardware, we have multiple CPUs and other devices on the computer bus.

Even in the past, the programmer often had to look at what was in memory in order to tell what the program was doing. He was, in effect, another CPU on the bus, as I said above.

Before we take a look at the reasons not to use least significant first byte order, let's look at the primary argument in favor: It theoretically speeds up some hardware processes and made the 8080 and 6502 (among other CPUs) cheaper to produce.

To figure out why, when you perform math on numbers, you start at the least significant end. Let's do a subtraction of two moderately large numbers:

123456
- 98765
-------
24691

You started with the column on the right,

6 - 5 = 1

right?

CPUs have to point at what they work on, and the idea is that, if they are pointing at the number already, it's handy to be pointing at the first byte to do the math on.

It sounds reasonable, now that you think of it, right?

There are some other issues, like aligning the number before you start, which also appear to have some advantage when the small end is what you point at.

Sounds like maybe the Little-Endian engineers know what they are onto?.

Oh, dear. Maybe the Big-Endians should just shut up.

Well, let's put those arguments aside for a moment and talk about what the human who is trying to debug a program is going to see when he or she looks at a number stored least significant byte first. I'm pretty sure I can show you some problems with the Little-Endian attitude here.

Simple tools are the ones that are usually available. We'll make use of hexdump. If you are working on a Microsoft Windows workstation, you can install Cygwin to get Unix tools, and Cygwin can give you access to hexdump and the gnu C compiler, gcc, and gforth (and lots of other good stuff like bc).

We'll also make use of a little programming in C:

/* Program to demonstrate the effect of LSB1st vs. MSB1st integers
// by Joel Matthew Rees, Amagasaki, Japan
// Copyright 2017 Joel Matthew Rees
// All rights reserved.
// Permission granted to use for personal purposes.
// See http://defining-computers.blogspot.com/2017/04/lsb-1st-vs-msb-1st-ignore-gulliver.html
// Can be downloaded here:
// https://osdn.net/users/reiisi/pastebin/5027
*/

#include <stdio.h>
#include <stdlib.h>

/* #define NO__DEPENDENCY_ON_LIMITS_H */

#if !defined NO__DEPENDENCY_ON_LIMITS_H
#include <limits.h>
# define byteWidth ( (size_t) CHAR_BIT )
# define byteMask ( (unsigned long) (unsigned char) ( (unsigned long) -1 ) )
# define ulBytes ( sizeof (unsigned long) ) /* a run-time constant */
#else
unsigned byteWidth = 8; /* Not depending on limits.h . */
unsigned long byteMask = 0xFF;
unsigned ulBytes = 4; /* Sane throw-away initial values. */

void setULbytes( void )
{ int i = 0;
   unsigned char chroller = 1;
   unsigned char chMask = 1;
   unsigned long ulroller = 1;
   while ( chroller != 0 )
   { chroller <<= 1;
      chMask = ( chMask << 1 ) | 1;
      ++i;
   }
   byteMask = chMask;
   byteWidth = i;
   i = 0;
   while ( ulroller != 0 )
   { ulroller <<= 1;
      ++i;
   }
   ulBytes = i / byteWidth;
}
#endif

int putLSB( unsigned long ivalue, int early )
{ int i = 0;
   do
   { putchar( ivalue & byteMask );
      ++i;
      ivalue >>= 8;
   } while ( ( i < ulBytes ) && !( early && ( ivalue == 0 ) ) );
   return i;
}

int putMSB( unsigned long ivalue, int early )
{ int i = 0;
   do
   { putchar( ( ivalue >> ( ( ulBytes - 1 ) * byteWidth ) ) & byteMask );
      ++i;
      ivalue <<= byteWidth;
   } while ( ( i < ulBytes ) && !( early && ( ivalue == 0 ) ) );
   return i;
}

void fillch( int count, char ch )
{ while ( count-- > 0 )
   { putchar( ch );
   }
}

int printInteger( unsigned long ivalue, unsigned base )
{ char buffer[ 65 ];
   char * cpt = buffer + 65;
   * --cpt = '\0';
   if ( base > 36 )
   { base = 10;
   }
   do
   { int ch = ivalue % base;
      ivalue /= base;
      ch += '0';
      if ( ch > '9' )
      { ch += 'A' - '9' - 1;
      }
      * --cpt = ch;
   } while ( ivalue > 0 );
   fputs( cpt, stdout );
   return 64 - ( cpt - buffer );
}

int main( int argc, char *argv[] )
{
   unsigned long my_integer = 123456;
   int index = 1;
   int length;

#if defined NO__DEPENDENCY_ON_LIMITS_H
   setULbytes();
#endif
   if ( argc > 1 )
   { char * endpt = argv[ 1 ];
      my_integer = strtoul( argv[ 1 ], &endpt, 0 );
      if ( endpt > argv[ 1 ] )
      { ++index;
      }
      else
      { my_integer = 123456;
      }
   }

   printf( "Data in memory: " );
   length = printInteger( my_integer, 10 );
   fillch( 32 - length, '\0' );
   length = printInteger( my_integer, 16 );
   fillch( 32 - length, '\0' );

   printf( "LSB1st early:   " );
   length = putLSB( my_integer, 1 );
   fillch( 16 - length, '-' );

   printf( "LSB1st full:    " );
   length = putLSB( my_integer, 0 );
   fillch( 16 - length, '-' );

   printf( "MSB1st early:   " );
   length = putMSB( my_integer, 1 );
   fillch( 16 - length, '-' );

   printf( "MSB1st full:    " );
   length = putMSB( my_integer, 0 );
   fillch( 16 - length, '-' );
   putchar( '\n' );

   return EXIT_SUCCESS;
}

[JMR201704281355:

This can be downloaded at

https://osdn.net/users/reiisi/pastebin/5027

A previous version at

~~https://osdn.net/users/reiisi/pastebin/5026~~

will eventually be taken off line.

]

Compile it with the usual

cc -Wall -o lsbmsb lsbmsb.c

and run it with something like

./lsbmsb | hexdump -C
./lsbmsb 1234567890 | hexdump -C
./lsbmsb 0x12345 | hexdump -C
./lsbmsb 0x12345 | hexdump # look at it two-byte.
./lsbmsb $(( 123456 * 256 )) | hexdump -C
# etc.

Be sure to leave the -C off a few times, to see what happens when it tries to interpret memory as a series of sixteen bit words instead of a series of eight bit bytes.

Hmm.

me@mycomputer:~/work/mathgames/eco101$ ./lsbmsb | hexdump -C
00000000 44 61 74 61 20 69 6e 20 6d 65 6d 6f 72 79 3a 20 |Data in memory: |
00000010 31 32 33 34 35 36 00 00 00 00 00 00 00 00 00 00 |123456..........|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 31 45 32 34 30 00 00 00 00 00 00 00 00 00 00 00 |1E240...........|
00000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000050 4c 53 42 31 73 74 20 65 61 72 6c 79 3a 20 20 20 |LSB1st early:   |
00000060 40 e2 01 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d |@..-------------|
00000070 4c 53 42 31 73 74 20 66 75 6c 6c 3a 20 20 20 20 |LSB1st full:    |
00000080 40 e2 01 00 00 00 00 00 2d 2d 2d 2d 2d 2d 2d 2d |@.......--------|
00000090 4d 53 42 31 73 74 20 65 61 72 6c 79 3a 20 20 20 |MSB1st early:   |
000000a0 00 00 00 00 00 01 e2 40 2d 2d 2d 2d 2d 2d 2d 2d |.......@--------|
000000b0 4d 53 42 31 73 74 20 66 75 6c 6c 3a 20 20 20 20 |MSB1st full:    |
000000c0 00 00 00 00 00 01 e2 40 2d 2d 2d 2d 2d 2d 2d 2d |.......@--------|
000000d0 0a                                                |.|
000000d1
me@mycomputer:~/work/mathgames/eco101$ ./lsbmsb | hexdump
0000000 6144 6174 6920 206e 656d 6f6d 7972 203a
0000010 3231 3433 3635 0000 0000 0000 0000 0000
0000020 0000 0000 0000 0000 0000 0000 0000 0000
0000030 4531 3432 0030 0000 0000 0000 0000 0000
0000040 0000 0000 0000 0000 0000 0000 0000 0000
0000050 534c 3142 7473 6520 7261 796c 203a 2020
0000060 e240 2d01 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d
0000070 534c 3142 7473 6620 6c75 3a6c 2020 2020
0000080 e240 0001 0000 0000 2d2d 2d2d 2d2d 2d2d
0000090 534d 3142 7473 6520 7261 796c 203a 2020
00000a0 0000 0000 0100 40e2 2d2d 2d2d 2d2d 2d2d
00000b0 534d 3142 7473 6620 6c75 3a6c 2020 2020
00000c0 0000 0000 0100 40e2 2d2d 2d2d 2d2d 2d2d
00000d0 000a
00000d1
me@mycomputer:~/work/mathgames/eco101$ ./lsbmsb 0x1E24000 | hexdump -C
00000000 44 61 74 61 20 69 6e 20 6d 65 6d 6f 72 79 3a 20 |Data in memory: |
00000010 33 31 36 30 34 37 33 36 00 00 00 00 00 00 00 00 |31604736........|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 31 45 32 34 30 30 30 00 00 00 00 00 00 00 00 00 |1E24000.........|
00000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000050 4c 53 42 31 73 74 20 65 61 72 6c 79 3a 20 20 20 |LSB1st early:   |
00000060 00 40 e2 01 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d |.@..------------|
00000070 4c 53 42 31 73 74 20 66 75 6c 6c 3a 20 20 20 20 |LSB1st full:    |
00000080 00 40 e2 01 00 00 00 00 2d 2d 2d 2d 2d 2d 2d 2d |.@......--------|
00000090 4d 53 42 31 73 74 20 65 61 72 6c 79 3a 20 20 20 |MSB1st early:   |
000000a0 00 00 00 00 01 e2 40 2d 2d 2d 2d 2d 2d 2d 2d 2d |......@---------|
000000b0 4d 53 42 31 73 74 20 66 75 6c 6c 3a 20 20 20 20 |MSB1st full:    |
000000c0 00 00 00 00 01 e2 40 00 2d 2d 2d 2d 2d 2d 2d 2d |......@.--------|
000000d0 0a                                                |.|
000000d1
me@mycomputer:~/work/mathgames/eco101$ ./lsbmsb 0x1E24000 | hexdump
0000000 6144 6174 6920 206e 656d 6f6d 7972 203a
0000010 3133 3036 3734 3633 0000 0000 0000 0000
0000020 0000 0000 0000 0000 0000 0000 0000 0000
0000030 4531 3432 3030 0030 0000 0000 0000 0000
0000040 0000 0000 0000 0000 0000 0000 0000 0000
0000050 534c 3142 7473 6520 7261 796c 203a 2020
0000060 4000 01e2 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d
0000070 534c 3142 7473 6620 6c75 3a6c 2020 2020
0000080 4000 01e2 0000 0000 2d2d 2d2d 2d2d 2d2d
0000090 534d 3142 7473 6520 7261 796c 203a 2020
00000a0 0000 0000 e201 2d40 2d2d 2d2d 2d2d 2d2d
00000b0 534d 3142 7473 6620 6c75 3a6c 2020 2020
00000c0 0000 0000 e201 0040 2d2d 2d2d 2d2d 2d2d
00000d0 000a
00000d1

Now you may be saying you'd rather not be looking at any of that, but if you really had to, if you had no choice but to look at one or the other, which would you rather look at? LSB1st or MSB1st? Remember, the numbers you are looking for will usually be mixed with text, and the text will likely help you find what you are looking for. If the text gets byte-swapped on you, it's going to be just that much harder.

The salesman says he has tools to let you look at the data, so you don't have to worry. That's all well and good, but it makes you dependent on the vendor, even when the vendor has time and budget to help you.

When the vendor doesn't have time or budget, wouldn't rather be able to use simple tools, at any rate? -- as a start before you set to making your own tools?

Somebody usually pipes up with, "Well, if you guys would all join us Little-Endians, if everybody did it all the same, there'd be no problems!"

So. From now on, everyone does Little-Endian. Blogs? News aggregators? Textbooks? Novels? Are we going to go back and reprint all the classics with Little-Endian numbers?

71 September 7871?

No, of course not. Much easier to just become dependent on our vendor. I mean, we trust them, right? And they deserve a guaranteed revenue stream, too.

Somebody pipes up about now saying everything I'm talking about is human stuff, not technical at all.

The Unicode Consortium determined that they did not want to be caught up in the argument. So they decided that Unicode characters could be encoded either direction. They even figured out how to put a flag called the Byte Order Mark at the beginning of a stream of Unicode text, to warn the consumer of the stream what order to expect the characters in.

Characters, you see, are not integers after all, contrary to the opinions of many a respected computer scientist. Little-Endian byte order enforces this factoid.

Well, the folks who defined the Internet decided they did not want to be reading data streams and crossing their eyes to read the IP addresses and other numeric data buried in the stream. So network byte order is the one that is easy to read, most significant first. If one hundred twenty-three thousand four hundred fifty-six is in the data stream, it shows up as 123456, not 654321.

In setting up the demonstrations of byte order differences, I went to some pain to demonstrate one big difference between the byte orders. If you are looking carefully at the dashes, you may see how least significant first allows you to optimize math. If you can track the presence of carries, you can stop adding small numbers to big ones as soon as the carries disappear.

Looks interesting, right?

Tracking the carries takes more processor time and resources than just simply finish the addition out. This is one of those false early optimizations that has historically killed a lot of software projects.

Worse, the programmer can look at one of these and think a particular case will never generate carries. This is almost always self-deception. The assumptions required to keep the carries from happening are almost always not valid in the end-user's context just often enough to cause hidden bugs of the integer overflow variety.

Isn't that strongly enough stated?

When we humans look at numbers, we perceive them as text. That allows us to do many things without consciously thinking of them, like move to the end or align them. CPUs have to do the same things with numberical text, as we can intuit by looking back at example 2.
When CPUs work with numbers, they have to figure out all sorts of things about the number which we subconsciously read from the text --

Is there a sign?
Is there a decimal point?
How big is the number?

If there is no text, they have no clue ... unless the programmer has already told them.

Here is perhaps the strongest argument against least significant first: It induces bugs into software.

Some people think it's a good thing to induce bugs into software. They think it guarantees their after-market revenue stream.

I think there will always be enough bugs without practicing dodgey false optimizations, but what do I know? I've wasted two days I didn't have tilting at this, erm, rainbow. (Or chasing this windmill, maybe?)

One of these days I'll get someone to pay me to design a language that combines the best of Forth and C. Then I'll be able to leap wide instruction sets with a single #ifdef, run faster than a speeding infinite loop with a #define, and stop all integer size bugs with a bare cast. And the first processor it will target will be a 32/64 bit version of the 6809 which will not be least significant bit first.

Friday, April 21, 2017

The Problem with Full Unicode Domain Names -- apple.com vs. appІe.com

Well, this one is taking longer to boil over than I expected. I've been watching for the storm for over fifteen years, and convoluted fixes on fixes have dodged the bullet this long.

[JMR201704221101: addendum]

I should note that the primary danger comes from clicking links given you by untrusted sources. The best solution here is not to do that. Abstain. Don't click on the links.

Copy them out, look at them in a text editor using a technical font that shows differences between I, 1, and l, and between 0 and O, etc.

Plug the URL into the search field of a web search engine -- Not into the URL bar of your browser, that takes you straight there. Let the search engine tell you what it knows about the site before you go there.

Then type in the domain name part by hand. If you have the URL

http://shop.apple.com/login/username=pinkfloy&mode=longstringofstuff

the domain name part is

shop.apple.com

(There's more that can be said, but I don't want to confuse you about controlling domains, so just type the whole domain name.)

If that's too much trouble, maybe you didn't want to go there anyway. But at least click on something the search engine shows you instead of the link in the e-mail.

[JMR201704221101: end addendum]

The problem:

Depending on your default fonts, you may be able to see a difference between the following two domain names:

apple.com vs. appІe.com

It's similar to the problem with

apple.com vs. appIe.com

but with a twist. The first one uses a Cyrillic (as in Russian) character to potentially cause confusion, where the second one keeps the trickiness all in the Latin (as in English) alphabet.

Let's look at both of those again, and I'll try to specify a font where there will be problems. First, we'll try the Ariel font (if it's on your computer):

apple.com vs. appІe.com
(Latin little "l" -- Cyrillic capital "І")

and next the Courier font (if it's on your computer):

apple.com vs. appІe.com
(Latin little "l" -- Cyrillic capital "І")

And we'll look at the Latin-only domain names, first in Arial:

apple.com vs. appIe.com
(Latin little "l" -- Latin capital "I")

and then in Courier:

apple.com vs. appIe.com
(Latin little "l" -- Latin capital "I")

Do you see what's happening?

Someone could grab the domain with the visual spoof and trick you into giving them your Apple login and password and maybe even your credit card number.

When domain names were all lower case Latin, we had fewer problems. In other words,

appIe.com

was properly spelled

appie.com

and the browser would display it in the latter form.

Well, there was still the problem with

app1e.com

substituting the number "1" for the little "l". But the registrars tended to try to help by refusing to register confusing domain names. And browsers were careful to use fonts that would show the differences in the URL bar.

Some time ago, pretty much all Unicode language scripts became allowed in domain names. This was strongly pushed by China, where they did not want {sarcasm-alert} to have all their loyal subjects surfing the Internet in Latin. That would let everyone see how superior English is, and that would never do.{end sarcasm-alert.}

(I shouldn't be sarcastic. They do need Chinese URLs. Otherwise, there would be too many companies competing for bai.com and ma.com.)

Apparently, non-Latin scripts seem to be even allowed to use capitals. Or, at least, unscrupulous or careless registrars seem to be allowing them in some cases. I'm not sure why.

(Here's the RFC. What am I missing?)

If the Cyrillic visual spoof I am using as an example were coerced to lower case in the URL bar, here's what it would look like in the Ariel font:

apple.com vs. appіe.com
(Latin little "l" -- Cyrillic lower case "і"

That would solve a lot of problems.

If you are worried about this, one thing that can help if you are using Firefox, type

about:config

in the URL bar. (That's where URLs like

https://www.lds.org

show up, and you can type them in by hand to go there.)

You'll get a warning that tells you that the Mozilla Foundation is not going to take the blame if you use non-default settings. (They won't anyway, but don't check the box that says you don't want to be warned. And remember that you have done this.)

Use the search bar to search for

punycode

and you'll find

network.IDN_show_punycode;false

Double-click the "false" and it will turn to "true". And then URLs like

www.appІe.com

will be displayed in the status bar as URLs like

www.xn--80ak6aa92e.com/

Now, that's ugly, don't you think? Anyway, you won't be mistaking it for

www.apple.com

(This is called punycode. Hmm. Actually, the Japanese page on punycode shows what's happening a little better than the English page.)

Then again, you will be wondering what that URL means. So I don't really know if I want to recommend it.

If I were a Mozilla developer involved with this, I would take a clue from what I've done above and do it like this:

www.apple.com (all Latin)

but

www.appІe.com (Cyrillic "І")

In other words, all the characters in URLs from languages other than the browser's default language would be displayed with colored backgrounds to make them stand out. And I might even add a warning bubble or something that said,

Warning! Mixed language URL contains Cyrillic "І"!

floating over the URL. This approach would mitigate a lot, including

Іds.org (Cyrillic)
аррӏе.com (Cyrillic)
ｐｅｒｌ.org (zenkaku, or full-width)

and so forth.

(I thought this was in the RFCs, but I'm not seeing it. Maybe I'm remembering my own thoughts on how to mitigate this particular semantic attack.)

I have advocated improving Unicode by reconstructing the encoding and including an international character set where such visual doublings could be eliminated. And separating Chinese and Japanese language encodings, and the three different Chinese encodings from each other, as well.

Nobody seems to like the idea.

It's a lot of work.

I'd be willing to do it relatively cheap! (Relatively.)