[Next] [Up] [Previous] [Home] [Other]

Computer Comments

When it comes to computers, like many people, I have my "druthers". Of course, perhaps many of them are just matters of habit, as I started learning how to program on an IBM System 360/67 at the University of Alberta, running the Michigan Terminal System.

"Big-Endian" Representation of Numbers

One of the things I liked about it was that it was big-endian; at the time, I took that for granted as the right way to design a computer.

Computers store data of many kinds in many formats. Here is what an area of memory containing data in character, and packed decimal format might look like on a big-endian computer:

1E00 | J | o | h | n |   | S | m | i |
      --- --- --- --- --- --- --- ---
       4A  6F  68  6E  20  53  6D  69

1E08 | t | h |   |   |   |   |   |   |
      --- --- --- --- --- --- --- ---
       74  68  20  20  20  20  20  20

1F00 |NUL|NUL|NUL|NUL|NUL|NAK|BEL| C |
      --- --- --- --- --- --- --- ---
       00  00  00  00  00  15  07  43

This area of memory might contain an employee record, showing that "John Smith" earns $1,507.43 per month.

Binary data would just be "packed hexadecimal", in the sense that the two hexadecimal digits representing the contents of each byte would, when concatenated, give the stored number in hexadecimal rather than decimal notation.

While a core dump is never easy to read at the best of times, on a big-endian computer, data is written in its memory in a format at least similar to that in which it is written on a piece of paper.

On some computers, the Honeywell 316 being an example, when the ability to work with integers twice the length of the computer's word length was added, the additional word containing the most significant part of the number was added following the word with the least significant part of the number.

In this way, the computer could fetch the first words of two numbers to be added, add them, save the carry, and then proceed to the second words. This saved a few circuits.

Thus, hexadecimal number 1E3F42 would be stored as 3F42 001E instead of 001E 3F42.

Then, the Digital Equipment Corporation came out with the PDP-11 computer. It was a 16-bit computer, and they included instructions for adding 32-bit numbers with it. They also placed the most significant word of the number second.

However, storing the letters of a word in the order 1234 and the bytes of a number in the order 3412 seemed ugly and inelegant.

So they decided that, when storing character data, they would store the first of two characters in the least-significant portion of a word.

Thus, the four letters of a word were stored in what I would view as the order 21 43, and the four bytes of a number were still stored in the order 34 12.

But, if instead of viewing the memory as someone who is used to a big-endian computer, and who knew the PDP-11 had a 16-bit word length would view it, one just viewed 32 bit numbers as numbers, whose bytes are always in order as 1234, then these bytes would contain characters in the order 4321.

Or, if one looks at the characters as always having the order 1234, as they would be written out on a printer, on a screen, or to a tape drive, then the bytes of a number would always have the order 4321.

This was less bad than 3412. Because DEC made less expensive computers than IBM, because it was identified with the independent-thinking "little guy", and because UNIX was first developed on the PDP-11, a large number of people got used to the little-endian way of doing things, and thus this method of representing numbers was found on the 8080 (and later the 8086 and 80386) from Intel and the 6502 from MOS Technology. The 16032 from National Semiconductor was a particuarly pure little-endian design. The 6800 (and later the 68000) from Motorola and the 9900 from Texas Instruments, on the other hand, were big-endian.

Of course, many of these computers fetched data a byte at a time, and even on those that fetch data in larger chunks, the characters really are in the order 1234 and the bytes of a number in the order 4321.

It should be noted, though, that while I think that big-endian is better, this is true only for cultural reasons, not absolute reasons. Arithmetic is slightly simplified with a little-endian representation; on the other hand, with a big-endian representation, string comparisons and comparisons of unsigned integers are similar.

In Arabic, while text is written from right to left, the most significant digit of a number is still on the left side. Hence, if we assume that a core dump will also run from right to left, with a header line like

F E D C B A 9 8 7 6 5 4 3 2 1 0

then both text and numeric fields will have their conventional appearance for a little-endian architecture.

Of course, text in European languages would look funny, but the same would be true of Arabic-language text in a core dump with memory locations running from left to right from a Western computer, whether little-endian or big-endian. But when considering the appropriateness of a computer architecture for a particular society, I am considering its own language, not foreign languages.

Naturally, I am assuming that when bytes representing printable characters are transmitted to the printer, they correspond to characters being printed from right to left, and that subroutines for converting numbers to printable form will store the least significant digit in the part of the output string with the lowest address.

That it is uncommon to find a computer all the aspects of which, from the hardware, and the operating system to the programming languages used are defined around the conventions of a language other than English, is something I am well aware of, but I find that to be a regrettable market failure.

Pascal Strings and Sequential Files

Even more common than little-endian numeric representation on today's computers is another difference between them and the System/360 model 67 I knew and loved.

On an IBM PC, a line of text might be stored like this:

This is a record.<CR><LF>

On a Macintosh, that same line of text would be stored like this:

This is a record.<CR>

And under UNIX, that same line of text would be stored this way:

This is a record.<LF>

Although these three formats are all different, and to me the one from the Macintosh is the appropriate one of this class of formats, they all have something in common of which I disapprove.

The end of a line of text is shown by actual ASCII control characters embedded in the file.

This means that I can't just save binary data in a file by using a statement like:

PRINT #2, MKI$(A%);NAME$

since if the number A% happens to be 13, then the record would start with <NUL><CR>, and the CR might be interpreted as the end of the record. This is why my preference is for the other method of handling text files which I describe below.

On the System/360 model 67 under MTS, text files could be of two types, SEQUENTIAL and LINE. In a SEQUENTIAL file, every line began with two bytes giving the length of the line. Records were limited to 32767 bytes in length, but they could contain any characters, because the length of the record was indicated outside of the text of the record, in two bytes which preceded the record, but which were handled internally by the operating system. Only the text of the record itself was presented to user programs.

A LINE file was a special case of an ISAM file. And an ISAM file was a keyed random access file.

In other words, an ISAM file was really a database table, with two fields.

One field was the line number. That was a 32-bit integer, scaled down by a factor of 1,000. Thus, its value could vary from -2,147,483.648 to 2,147,483.647. (Microsoft Access has a similar data type, the currency data type, which is a 64-bit integer, scaled down by a factor of 10,000.) The file was indexed on that field.

The other field was the text of the record. This was a memo field. Originally, this field could contain from 0 to 255 characters.

Thus, the operating system actually contained a standard database engine, available to applications. Except for LINE files, which were the default text file format, one would need to program in COBOL or in PL/I to request an ISAM file and specify its fields.

Similarly, in C the length of a string is indicated by having a byte containing the numeric value zero in its bits, or a NUL character, after the last character belonging to the text of the string.

In Pascal, byte zero of a string contains its length.

In both C and Pascal, strings are arrays of characters; the third character of the string S is S[2] in C and S[3] in Pascal. Strings are declared as having a maximum length, and storage is reserved for the individual string for its full maximum length.

In BASIC (as extended to include strings by Microsoft), the length of a string is also stored outside a string, but invisibly. Strings belong to a dynamically allocated memory pool, and to access the third character of a string S$ one uses the function MID$( S$, 3, 1 ).

In FORTRAN (here meaning WATFIV, not FORTRAN IV with IBM extensions, as I mistakenly thought), constants of type CHARACTER*n always contained n characters, but one could store a shorter word or name in such a constant by padding it with trailing blanks (that is, printable space characters). Such a constant was treated as a simple variable; if one wished to access the third character of a CHARACTER*16 constant NAME, one would just declare both it and a 16-element array of CHARACTER*1 values, called LETTER, like this:

C EXAMPLE OF CHARACTER DECLARATION IN FORTRAN
      CHARACTER*16  NAME
      CHARACTER*1   LETTER(16)
      EQUIVALENCE   (NAME, LETTER)

and then one could refer to the third character in NAME by referring to LETTER(3).


[Next] [Up] [Previous] [Home] [Other]