Character encodings and subtle distinctions

Working effectively with character encodings as a software developer requires keeping in mind some subtle distinctions. Unfortunately, most human brains seem to be rather poor at working with subtle distinctions.

To deal with character encodings correctly, you need to bear in mind the difference between a character set, which is an assignment of characters to a sequence of numbers, and a character encoding, which is how that sequence of numbers is represented in bytes. In theory this should be quite a straightforward distinction to keep in mind, but my experience seeing other developers frequently introducing bugs in their code when dealing with character encodings tells me that it wasn’t all that easy.

For Joel Spolsky, pointers seem to trigger a similar sort of problem:

I’ve come to realize that understanding pointers in C is not a skill, it’s an aptitude. […] For some reason most people seem to be born without the part of the brain that understands pointers. Pointers require a complex form of doubly-indirected thinking that some people just can’t do, and it’s pretty crucial to good programming.

A part of this is generational. For Spolsky’s generation, pointers were the problem that exposed this inability. People my age don’t do much programming in C or C++, so for us it was character encodings. Now that UTF-8 is assumed pretty much everywhere, there will probably be some other class of problem that trips people up in this way.

So there’s a question with regard to character encodings: did they not understand the distinction between character sets and encodings, or did they fail to keep it in mind? So far I’ve been assuming it’s the latter; that they would know in the abstract, but then fail to apply that knowledge. But I could be wrong.

My assumption is that when you’re dealing with a character, the difference between the ordinal value of that character and the byte representation of that ordinal value is so close in concept-space that the two end up smushed together in your mind. There’s a sort of compression that takes place quite naturally between the two, with the result that you fail to keep track of them as separate things.