Misunderstanding Computers

Why do we insist on seeing the computer as a magic box for controlling other people?
人はどうしてコンピュータを、人を制する魔法の箱として考えたいのですか?
Why do we want so much to control others when we won't control ourselves?
どうしてそれほど、自分を制しないのに、人をコントロールしたいのですか?

Computer memory is just fancy paper, CPUs are just fancy pens with fancy erasers, and the network is just a fancy backyard fence.
コンピュータの記憶というものはただ改良した紙ですし、CPU 何て特長ある筆に特殊の消しゴムがついたものにすぎないし、ネットワークそのものは裏庭の塀が少し拡大されたものぐらいです。

(original post/元の投稿 -- defining computers site/コンピュータを定義しようのサイト)

Saturday, October 7, 2017

Languages in a Common Character Code for Information Interchange

Having said a bit about why I want to re-invent Unicode (so to speak), I want to rant a little about the overall structure, relative to languages, that I propose for this Common Code for Information Interchange, as I am calling it.

I've talked a little about the goals, and the structure, in the past. Much of what I said there I still consider valid, but I want to take a different approach here, look from the outside in a bit.

First, I plan the encoding to be organized in an open-ended way, the primary reason being that language is always changing.

Second, there will be a small subset devoted primarily to the technical needs of encoding and parsing, which I will describe in more detail in a separate rant.

Third, there will be an international or interlocality context or subset, which will be relatively small, and will attempt to include enough of each current language for international business and trade. This will appear to be a subset of Unicode, but will not be a proper subset. I have not defined much of this, but I will describe what I can separately.

Parsing rules for this international subset will be as simple as possible, which means that they will depart, to some extent at least, from the rules of any particular local context.

Third, part two, there will be spans allocated for each locality within which context-local parsing and construction rules will operate.

Fourth, there will be room in each span for expansion, and rules to enable the expansion. Composition will be one such set of rules, and there will be room for dynamically allocating single code points for composed characters used in a document.

The methods of permanently allocating common composed characters should reflect the methods of temporary allocation.

Fifth, as much as possible, existing encodings will be included by offset. For instance, the JIS encoding will exist as a span starting at some multiple of 65536, which I have not yet determined, and the other "traditional" encodings will also have spans at offsets of some multiple of two. The rules for parsing will change for each local span.

I've thought about giving Unicode a span, but am not currently convinced it is possible.

Of course, this means that the encoding is assumed to require more than will fit comfortably in four bytes after UTF-8 compression.

And thinking of UTF-8 brings me to the next rant.

A Common Code for Information Interchange

I've been thinking about this topic since I first heard of plans for what become Unicode, back in the mid 1980s.

At the time, there were many in the industry who still thought that 64K of RAM should be enough for general personal computing, and there were many people who thought 65,536 should be enough characters to cover at least all modern languages. I tried to tell as many people as I could that Japanese alone had more than 20,000 and Chinese had an estimated count in the range of a hundred thousand, and no one believed me. I also tried to tell people that we shouldn't conflate Japanese and Chinese characters, but I had a hard time convincing even myself of that.

I also tried to tell people that the Kanji radicals should be encoded first, but the Japanese standards organization wasn't doing that, so why should anyone believe it?

As I noted recently, some of the problems of the current approach to unifying the world's character sets are becoming obvious.

Each natural human language is it's own context. And each language has it's own set of sub-contexts which we call dialects. But neither the contexts nor sub-contexts nor the sets of sub-contexts are well-defined, mathematically speaking, which means that mathematical methods can not be used to perfectly parse any natural language.

Therefore, programs which work with human languages are necessarily buggy. That is, we know that, no matter how carefully they are constructed, they will always contain errors.

When we combine language contexts, we combine error rates with error rates, and the result is at best multiplicative. It is not simply additive. So we really should not want to do that. But that's what Unicode tries to do -- combine character codes for all relevant languages into one over-arching set.

Actually, for all my pessimism, Unicode works about as well as we should expect it to. I just want something better, but it's hard to describe exactly what that something is. This rant is an attempt to do so.

With just US English, it's fairly easy to construct a text editor. Parsing the entered text requires around ten simple functions, and visual formatting less than ten more. Word processors are fairly straightforward, as well.

With Unicode, a simple text editor requires more like a hundred functions, interacting in ways that are anything but obvious.

And if you need to rely on what you read in the text, as I noted in the rant linked above, you find that displaying the text reliably adds significantly more complexity.

Actually, US English is almost unnaturally simple to parse (relatively speaking). That's why it has been adopted over French, Spanish, Russian, and German, and why you don't hear much of Japanese plans to make Japanese the international language, and why the Chinese Communist Party's dreams of making Chinese the international language just will never fly, no matter how significant a fraction of the world's population ostensibly speaks Chinese as a native or second language.

Memorizing 9000+ characters for basic literacy requires starting at the age of two, I hear.

The Chinese may claim a full third, but the other two thirds are not going to happily and willingly accept being forced to propogandize their children (or themselves) with that many characters just to be literate. That alone is oppressive enough to prevent a productive peace.

Even the Japanese subset of two thousand for school literacy basically requires all twelve years of the primary grades to complete.

If we could reduce that burden by teaching the radicals first (We westerners call the sub-parts of Kanji "radicals".), we might have hope to address the difficulty, but the radicals themselves are an added layer of parsing. That's multiplicative complexity, which is one of the reasons that approach has not been successful as a general approach. (It is taught, I understand, in some schools of Japanese calligraphy, but that is not a large fraction of the population.)

And the rules for assembling and parsing the radicals are anything but simple.

Now, you may be wondering why I think the radicals should be prioritized in the encoding, but the dirty secret of Kanji is that they are not a closed set, any more than English vocabulary is a closed set. Every now and then someone invents a new one.

Methods to address new coinage must be part of the basic encoding.

This is getting long, and I think I'll wrap up my rant on my motivations for considering something to supercede Unicode here.

I wrote up a summary list of overall goals about three years back, here.

As I've said elsewhere, Unicode has served a purpose until now, and will continue to do so for a few more years, but we need something better.

It needs to provide better separation for the contexts of languages.