defining computers: LSB 1st vs. MSB 1st -- Ignore Gulliver

I was working on a programming problem for a novel I'm trying to write, and got this bug under my skin again.

Something like fifty years ago, an argument raged among computer engineers over the order in which numbers should be stored in computer memory.

During the arguments, some (mostly among the least-significant-first camp, IIRC) pointed out the Lilliputian argument between the "Little-Endians" and "Big-Endians" in Gulliver's Travels. The least-significant-first camp claimed the position of Little-Endian-ness, which left the most-significant-first camp associated with High Church once the allegory was commonly accepted.

In Gulliver's Travels, the arguments between Lilliput and Blefuscu, including the endianness argument, are depicted by Swift as mere matters of fashion.

Most of the least-significant-first camp took the same approach: In memory, endianness doesn't matter.

This was a bit of duplicitous implicit poisoning of the well, similar to the habit Intel salescrew had a decade or two later of claiming that Turing complete CPUs were all equivalent, therefore we should all buy their self-proclaimed most popular CPU -- false analogy and a lot of logical steps being skipped among other things.

To summarize the question of byte order, we need to take a general look at data in computer memory. Computer memory is organized as an array of sequentially addressable elements, which implies an ordering to the contents of memory:

Example 1, no number.
00	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22
D	a	t	a		i	n		m	e	m	o	r	y	.	0	0	0	0	0	0	0	0

Let's put the number 123456 (one hundred twenty three thousand four hundred fifty-six) encoded as text after the description:

Example 2, number 123456 as text.
00	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22
D	a	t	a		i	n		m	e	m	o	r	y	:		1	2	3	4	5	6	0

Note that text is naturally recorded most significant digit first in English.

(Thousands group separators just get in the way in computers, so I just left the comma out.)

If we wrote 123456 textually least significant digit first, it would look like this:

Example 3, number 123456
as least significant digit first text.
00	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22
D	a	t	a		i	n		m	e	m	o	r	y	:		6	5	4	3	2	1	0

You may be wondering why someone would write numbers backwards, but there are actually language/numeric contexts in which least significant digit first is the common order. (They may be useful to refer to later.) Even in English, we have contexts like dates where the order is not most significant first:

September 17, 1787 (mixed order) and
17 September 1787 (least significant first)

So we know that it is not completely unthinkable to do such a thing.

Now, text is actually encoded in computer memory as strings of numeric codes. Let's look at the data in the second example, reading it as hexadecimal numbers that represent the characters of the text instead of interpreting it as text:

Example 2, view 2, (ASCI/Unicode UTF-8),
raw contents displayed in hexadecimal.
00	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22
44	61	74	61	20	69	6E	20	6D	65	6D	6F	72	79	3A	20	31	32	33	34	35	36	00

That's interesting, isn't it?

No?

Okay, let's leave everything but the number interpreted as text:

Example 2, view 3,
numeric text "123456" uninterpreted and shown in hexadecimal.
00	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22
D	a	t	a		i	n		m	e	m	o	r	y	:		31	32	33	34	35	36	00

Now, we haven't actually been changing what is in memory in Example 2. We're just changing how we look at it. We are trying to get a sense of what is actually stored in memory. (If you have a decent OS, you have command line tools like hexdump that allow you to look at files this way. You should try it some time.)

So, now let's try changing the form of the number. Instead of putting it in as text, let's put it in as a number -- an integer. (It's convenient that the address where it will go is 16, for something we call alignment, but we won't really talk about that just yet.)

First, we need to rewrite 123456 (one hundred twenty-three thousand four hundred fifty-six) as a hexadecimal number:

123456 ÷ 16⁴ = 1 rem 57920
57920 ÷ 16³ = 14 (E₁₆) rem 576
576 ÷ 16² = 2 rem 64
64 ÷ 16¹ = 4 rem 0

So,

123456 == 1E240₁₆

Two hexadecimal digits take one byte in memory on a computer with an 8 bit byte.

(Numbers up to 4294967295 (FFFFFFFF₁₆) can be stored in four bytes on computers with an 8 bit byte.)

Let's look at 123456 (1E240₁₆) stored at location 16, most significant byte first:

Example 4 MSB,
123456 (1E240₁₆) directly in memory, MSB first.
00	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22
D	a	t	a		i	n		m	e	m	o	r	y	:		00	01	E2	40	0	0	0

Now let's look at the same number, stored least significant byte first:

Example 4 LSB,
123456 (1E240₁₆) directly in memory, LSB first.
00	01	02	03	04	05	06	07	08	09	10	11	12	13	14	15	16	17	18	19	20	21	22
D	a	t	a		i	n		m	e	m	o	r	y	:		40	E2	01	00	0	0	0

For a CPU that is MSB first, it will always store and read MSB first (as in example 3), so there's no problem.

And an LSB first CPU will always store and read LSB first, so, again, no problem.

The CPU is built to do it one way or the other, and it will always do it the way it's built, so there's no problem here.

That's the essence of the argument.

It's no longer true, and it really was never true. All bus masters have to agree on how they store numbers in a particular chunk of data or the numbers get turned upside down. (Or in the case of mixed mode integers, inside out and upside down, etc.)

Back then, however, CPUs did not usually have the ability to switch byte order without a bit of work. And alternate bus masters were not as common as now, and usually tended to be built specifically for the CPU.

These days, with intelligent I/O devices, alternate bus masters are rather common. (Graphics co-processors, network interfaces, disk drive interfaces, etc.) If one is going to be a bad boy and do byte order backwards from the rest, unless you isolate the bad boy somehow, things tend to fall apart.

But even the ability to switch byte order does not materially change the arguments.

On a CPU that can switch byte order natively, byte order becomes just another property of the integer stored in memory, which the programmer must keep track of, along with the address, size, presence of sign, etc. As long as the software and hardware respect the properties of the integer in memory, there is no problem.

Well, no problem in isolation.

But there is one virtual bus master that tends, in most of the world, to be most significant first when looking at numbers -- the human who might debug the program by looking at the raw contents of memory without access to the detail of the compiled program.

No number exists in isolation.

There it is, the fatal assumption of the argument:

... in isolation ...

Nothing in this world exists in isolation.

Why am I going around in circles on this subject?

In modern hardware, we have multiple CPUs and other devices on the computer bus.

Even in the past, the programmer often had to look at what was in memory in order to tell what the program was doing. He was, in effect, another CPU on the bus, as I said above.

Before we take a look at the reasons not to use least significant first byte order, let's look at the primary argument in favor: It theoretically speeds up some hardware processes and made the 8080 and 6502 (among other CPUs) cheaper to produce.

To figure out why, when you perform math on numbers, you start at the least significant end. Let's do a subtraction of two moderately large numbers:

123456
- 98765
-------
24691

You started with the column on the right,

6 - 5 = 1

right?

CPUs have to point at what they work on, and the idea is that, if they are pointing at the number already, it's handy to be pointing at the first byte to do the math on.

It sounds reasonable, now that you think of it, right?

There are some other issues, like aligning the number before you start, which also appear to have some advantage when the small end is what you point at.

Sounds like maybe the Little-Endian engineers know what they are onto?.

Oh, dear. Maybe the Big-Endians should just shut up.

Well, let's put those arguments aside for a moment and talk about what the human who is trying to debug a program is going to see when he or she looks at a number stored least significant byte first. I'm pretty sure I can show you some problems with the Little-Endian attitude here.

Simple tools are the ones that are usually available. We'll make use of hexdump. If you are working on a Microsoft Windows workstation, you can install Cygwin to get Unix tools, and Cygwin can give you access to hexdump and the gnu C compiler, gcc, and gforth (and lots of other good stuff like bc).

We'll also make use of a little programming in C:

/* Program to demonstrate the effect of LSB1st vs. MSB1st integers
// by Joel Matthew Rees, Amagasaki, Japan
// Copyright 2017 Joel Matthew Rees
// All rights reserved.
// Permission granted to use for personal purposes.
// See http://defining-computers.blogspot.com/2017/04/lsb-1st-vs-msb-1st-ignore-gulliver.html
// Can be downloaded here:
// https://osdn.net/users/reiisi/pastebin/5027
*/

#include <stdio.h>
#include <stdlib.h>

/* #define NO__DEPENDENCY_ON_LIMITS_H */

#if !defined NO__DEPENDENCY_ON_LIMITS_H
#include <limits.h>
# define byteWidth ( (size_t) CHAR_BIT )
# define byteMask ( (unsigned long) (unsigned char) ( (unsigned long) -1 ) )
# define ulBytes ( sizeof (unsigned long) ) /* a run-time constant */
#else
unsigned byteWidth = 8; /* Not depending on limits.h . */
unsigned long byteMask = 0xFF;
unsigned ulBytes = 4; /* Sane throw-away initial values. */

void setULbytes( void )
{ int i = 0;
   unsigned char chroller = 1;
   unsigned char chMask = 1;
   unsigned long ulroller = 1;
   while ( chroller != 0 )
   { chroller <<= 1;
      chMask = ( chMask << 1 ) | 1;
      ++i;
   }
   byteMask = chMask;
   byteWidth = i;
   i = 0;
   while ( ulroller != 0 )
   { ulroller <<= 1;
      ++i;
   }
   ulBytes = i / byteWidth;
}
#endif

int putLSB( unsigned long ivalue, int early )
{ int i = 0;
   do
   { putchar( ivalue & byteMask );
      ++i;
      ivalue >>= 8;
   } while ( ( i < ulBytes ) && !( early && ( ivalue == 0 ) ) );
   return i;
}

int putMSB( unsigned long ivalue, int early )
{ int i = 0;
   do
   { putchar( ( ivalue >> ( ( ulBytes - 1 ) * byteWidth ) ) & byteMask );
      ++i;
      ivalue <<= byteWidth;
   } while ( ( i < ulBytes ) && !( early && ( ivalue == 0 ) ) );
   return i;
}

void fillch( int count, char ch )
{ while ( count-- > 0 )
   { putchar( ch );
   }
}

int printInteger( unsigned long ivalue, unsigned base )
{ char buffer[ 65 ];
   char * cpt = buffer + 65;
   * --cpt = '\0';
   if ( base > 36 )
   { base = 10;
   }
   do
   { int ch = ivalue % base;
      ivalue /= base;
      ch += '0';
      if ( ch > '9' )
      { ch += 'A' - '9' - 1;
      }
      * --cpt = ch;
   } while ( ivalue > 0 );
   fputs( cpt, stdout );
   return 64 - ( cpt - buffer );
}

int main( int argc, char *argv[] )
{
   unsigned long my_integer = 123456;
   int index = 1;
   int length;

#if defined NO__DEPENDENCY_ON_LIMITS_H
   setULbytes();
#endif
   if ( argc > 1 )
   { char * endpt = argv[ 1 ];
      my_integer = strtoul( argv[ 1 ], &endpt, 0 );
      if ( endpt > argv[ 1 ] )
      { ++index;
      }
      else
      { my_integer = 123456;
      }
   }

   printf( "Data in memory: " );
   length = printInteger( my_integer, 10 );
   fillch( 32 - length, '\0' );
   length = printInteger( my_integer, 16 );
   fillch( 32 - length, '\0' );

   printf( "LSB1st early:   " );
   length = putLSB( my_integer, 1 );
   fillch( 16 - length, '-' );

   printf( "LSB1st full:    " );
   length = putLSB( my_integer, 0 );
   fillch( 16 - length, '-' );

   printf( "MSB1st early:   " );
   length = putMSB( my_integer, 1 );
   fillch( 16 - length, '-' );

   printf( "MSB1st full:    " );
   length = putMSB( my_integer, 0 );
   fillch( 16 - length, '-' );
   putchar( '\n' );

   return EXIT_SUCCESS;
}

[JMR201704281355:

This can be downloaded at

https://osdn.net/users/reiisi/pastebin/5027

A previous version at

~~https://osdn.net/users/reiisi/pastebin/5026~~

will eventually be taken off line.

]

Compile it with the usual

cc -Wall -o lsbmsb lsbmsb.c

and run it with something like

./lsbmsb | hexdump -C
./lsbmsb 1234567890 | hexdump -C
./lsbmsb 0x12345 | hexdump -C
./lsbmsb 0x12345 | hexdump # look at it two-byte.
./lsbmsb $(( 123456 * 256 )) | hexdump -C
# etc.

Be sure to leave the -C off a few times, to see what happens when it tries to interpret memory as a series of sixteen bit words instead of a series of eight bit bytes.

Hmm.

me@mycomputer:~/work/mathgames/eco101$ ./lsbmsb | hexdump -C
00000000 44 61 74 61 20 69 6e 20 6d 65 6d 6f 72 79 3a 20 |Data in memory: |
00000010 31 32 33 34 35 36 00 00 00 00 00 00 00 00 00 00 |123456..........|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 31 45 32 34 30 00 00 00 00 00 00 00 00 00 00 00 |1E240...........|
00000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000050 4c 53 42 31 73 74 20 65 61 72 6c 79 3a 20 20 20 |LSB1st early:   |
00000060 40 e2 01 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d |@..-------------|
00000070 4c 53 42 31 73 74 20 66 75 6c 6c 3a 20 20 20 20 |LSB1st full:    |
00000080 40 e2 01 00 00 00 00 00 2d 2d 2d 2d 2d 2d 2d 2d |@.......--------|
00000090 4d 53 42 31 73 74 20 65 61 72 6c 79 3a 20 20 20 |MSB1st early:   |
000000a0 00 00 00 00 00 01 e2 40 2d 2d 2d 2d 2d 2d 2d 2d |.......@--------|
000000b0 4d 53 42 31 73 74 20 66 75 6c 6c 3a 20 20 20 20 |MSB1st full:    |
000000c0 00 00 00 00 00 01 e2 40 2d 2d 2d 2d 2d 2d 2d 2d |.......@--------|
000000d0 0a                                                |.|
000000d1
me@mycomputer:~/work/mathgames/eco101$ ./lsbmsb | hexdump
0000000 6144 6174 6920 206e 656d 6f6d 7972 203a
0000010 3231 3433 3635 0000 0000 0000 0000 0000
0000020 0000 0000 0000 0000 0000 0000 0000 0000
0000030 4531 3432 0030 0000 0000 0000 0000 0000
0000040 0000 0000 0000 0000 0000 0000 0000 0000
0000050 534c 3142 7473 6520 7261 796c 203a 2020
0000060 e240 2d01 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d
0000070 534c 3142 7473 6620 6c75 3a6c 2020 2020
0000080 e240 0001 0000 0000 2d2d 2d2d 2d2d 2d2d
0000090 534d 3142 7473 6520 7261 796c 203a 2020
00000a0 0000 0000 0100 40e2 2d2d 2d2d 2d2d 2d2d
00000b0 534d 3142 7473 6620 6c75 3a6c 2020 2020
00000c0 0000 0000 0100 40e2 2d2d 2d2d 2d2d 2d2d
00000d0 000a
00000d1
me@mycomputer:~/work/mathgames/eco101$ ./lsbmsb 0x1E24000 | hexdump -C
00000000 44 61 74 61 20 69 6e 20 6d 65 6d 6f 72 79 3a 20 |Data in memory: |
00000010 33 31 36 30 34 37 33 36 00 00 00 00 00 00 00 00 |31604736........|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 31 45 32 34 30 30 30 00 00 00 00 00 00 00 00 00 |1E24000.........|
00000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000050 4c 53 42 31 73 74 20 65 61 72 6c 79 3a 20 20 20 |LSB1st early:   |
00000060 00 40 e2 01 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d |.@..------------|
00000070 4c 53 42 31 73 74 20 66 75 6c 6c 3a 20 20 20 20 |LSB1st full:    |
00000080 00 40 e2 01 00 00 00 00 2d 2d 2d 2d 2d 2d 2d 2d |.@......--------|
00000090 4d 53 42 31 73 74 20 65 61 72 6c 79 3a 20 20 20 |MSB1st early:   |
000000a0 00 00 00 00 01 e2 40 2d 2d 2d 2d 2d 2d 2d 2d 2d |......@---------|
000000b0 4d 53 42 31 73 74 20 66 75 6c 6c 3a 20 20 20 20 |MSB1st full:    |
000000c0 00 00 00 00 01 e2 40 00 2d 2d 2d 2d 2d 2d 2d 2d |......@.--------|
000000d0 0a                                                |.|
000000d1
me@mycomputer:~/work/mathgames/eco101$ ./lsbmsb 0x1E24000 | hexdump
0000000 6144 6174 6920 206e 656d 6f6d 7972 203a
0000010 3133 3036 3734 3633 0000 0000 0000 0000
0000020 0000 0000 0000 0000 0000 0000 0000 0000
0000030 4531 3432 3030 0030 0000 0000 0000 0000
0000040 0000 0000 0000 0000 0000 0000 0000 0000
0000050 534c 3142 7473 6520 7261 796c 203a 2020
0000060 4000 01e2 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d
0000070 534c 3142 7473 6620 6c75 3a6c 2020 2020
0000080 4000 01e2 0000 0000 2d2d 2d2d 2d2d 2d2d
0000090 534d 3142 7473 6520 7261 796c 203a 2020
00000a0 0000 0000 e201 2d40 2d2d 2d2d 2d2d 2d2d
00000b0 534d 3142 7473 6620 6c75 3a6c 2020 2020
00000c0 0000 0000 e201 0040 2d2d 2d2d 2d2d 2d2d
00000d0 000a
00000d1

Now you may be saying you'd rather not be looking at any of that, but if you really had to, if you had no choice but to look at one or the other, which would you rather look at? LSB1st or MSB1st? Remember, the numbers you are looking for will usually be mixed with text, and the text will likely help you find what you are looking for. If the text gets byte-swapped on you, it's going to be just that much harder.

The salesman says he has tools to let you look at the data, so you don't have to worry. That's all well and good, but it makes you dependent on the vendor, even when the vendor has time and budget to help you.

When the vendor doesn't have time or budget, wouldn't rather be able to use simple tools, at any rate? -- as a start before you set to making your own tools?

Somebody usually pipes up with, "Well, if you guys would all join us Little-Endians, if everybody did it all the same, there'd be no problems!"

So. From now on, everyone does Little-Endian. Blogs? News aggregators? Textbooks? Novels? Are we going to go back and reprint all the classics with Little-Endian numbers?

71 September 7871?

No, of course not. Much easier to just become dependent on our vendor. I mean, we trust them, right? And they deserve a guaranteed revenue stream, too.

Somebody pipes up about now saying everything I'm talking about is human stuff, not technical at all.

The Unicode Consortium determined that they did not want to be caught up in the argument. So they decided that Unicode characters could be encoded either direction. They even figured out how to put a flag called the Byte Order Mark at the beginning of a stream of Unicode text, to warn the consumer of the stream what order to expect the characters in.

Characters, you see, are not integers after all, contrary to the opinions of many a respected computer scientist. Little-Endian byte order enforces this factoid.

Well, the folks who defined the Internet decided they did not want to be reading data streams and crossing their eyes to read the IP addresses and other numeric data buried in the stream. So network byte order is the one that is easy to read, most significant first. If one hundred twenty-three thousand four hundred fifty-six is in the data stream, it shows up as 123456, not 654321.

In setting up the demonstrations of byte order differences, I went to some pain to demonstrate one big difference between the byte orders. If you are looking carefully at the dashes, you may see how least significant first allows you to optimize math. If you can track the presence of carries, you can stop adding small numbers to big ones as soon as the carries disappear.

Looks interesting, right?

Tracking the carries takes more processor time and resources than just simply finish the addition out. This is one of those false early optimizations that has historically killed a lot of software projects.

Worse, the programmer can look at one of these and think a particular case will never generate carries. This is almost always self-deception. The assumptions required to keep the carries from happening are almost always not valid in the end-user's context just often enough to cause hidden bugs of the integer overflow variety.

Isn't that strongly enough stated?

When we humans look at numbers, we perceive them as text. That allows us to do many things without consciously thinking of them, like move to the end or align them. CPUs have to do the same things with numberical text, as we can intuit by looking back at example 2.
When CPUs work with numbers, they have to figure out all sorts of things about the number which we subconsciously read from the text --

Is there a sign?
Is there a decimal point?
How big is the number?

If there is no text, they have no clue ... unless the programmer has already told them.

Here is perhaps the strongest argument against least significant first: It induces bugs into software.

Some people think it's a good thing to induce bugs into software. They think it guarantees their after-market revenue stream.

I think there will always be enough bugs without practicing dodgey false optimizations, but what do I know? I've wasted two days I didn't have tilting at this, erm, rainbow. (Or chasing this windmill, maybe?)

One of these days I'll get someone to pay me to design a language that combines the best of Forth and C. Then I'll be able to leap wide instruction sets with a single #ifdef, run faster than a speeding infinite loop with a #define, and stop all integer size bugs with a bare cast. And the first processor it will target will be a 32/64 bit version of the 6809 which will not be least significant bit first.

defining computers

Misunderstanding Computers

Friday, April 28, 2017

LSB 1st vs. MSB 1st -- Ignore Gulliver

1 comment: