Sergey Babkin on CEP and stuff: BCPL, B, and C

Have you ever wondered, what were the earlier BCPL and B languages? As it turns out, Bell Labs still hosts Dennis Ritchie's historic web pages (http://cm.bell-labs.co/who/dmr/) that include the descriptions of BCPL (http://cm.bell-labs.co/who/dmr/bcpl.pdf) and B (http://cm.bell-labs.co/who/dmr/kbman.pdf), and also the source code of a couple of early versions of the C compiler (http://cm.bell-labs.co/who/dmr/primevalC.html).

So for what I can tell, B and C aren't exactly the breaking points in the language itself, instead the language had gradually involved from BCPL (just as according to the stories there weren't any versions of Unix inside Bell Labs research, it was just snapshotted and packaged as versions for external consumption). BCPL is a very old language, from the times when the computers had widely differing character sets (and mostly in 6-bit encodings), so the language document defined the logical lexemes which got mapped to specific characters in each implementation. So you could see how they'd say "let's use curly braces for the block start and end", "let's skip the statements that are duplications to save memory" (BCPL had separate keywords for positive and negative conditions in the loops, and also the version of "if" statement with "else" was named "test"), "some more operators would be convenient", and get a Bell Labs dialect of BCPL that became B. And then B is almost the same as an early version of C.

It looks like the breaking points for "a new language" were in fact determined by the compiler back-end, the code generation. B went from direct assembly to "threaded code" (having nothing to do with the our modern idea of threads), which was apparently a popular concept in the 1970s (and still most widely known from the Forth language). The idea there is that the program is built of a sequence of subroutine addresses, each of them essentially an instruction of a virtual machine. But these instructions don't need parsing, all you need to do is to jump to the address and you get to the machine-code subroutine that implements this virtual instruction. Except that at the end of the subroutine it doesn't do a return, instead it does the jump to the address in the next virtual machine instruction. On PDP-11 this jump was done as

jmp *(r3)+

with r3 containing the program counter of the virtual machine. That's where the "threaded" in the name comes from, r3 is the "thread" that sews together the code fragments. The advantage is that the generated code is more compact than the direct machine code (a big deal for the very small memories of the day), and also as the interpreters go, it's a very fast one, losing to the direct machine code only by a factor of 3-5 or so. The idea of threaded code can still be useful for the asynchronous programming, as described in https://babkin-cep.blogspot.com/2025/02/asynchronous-programming-9-composition.html. And then C went back from the threaded code to the direct machine code generation.

Some interesting tidbits about BCPL and B:

* The lvalue and rvalue terms come all the way from the CPL (of which BCPL is a reduced version). There are no words "pointer" nor "reference" nor even "address" in the BCPL document, only "lvalue". It actually would be very difficult to understand without the prior knowledge of lvalue and rvalue terms, but my guess is that it builds on the CPL definition with a better explanation.

* BCPL is a word-based language where the addresses are of the words, and so there is no difference between integer and pointer arithmetic. So when implemented on PDP-11, a byte-addressed 16-bit machine, B still has lvalues as word addresses, shifted right by one bit from the machine addresses. So adding 1 to the address actually does add 1 to the number, and the opposite shift left by one bit is done only when dereferencing the pointer.

* The strings had both '\0' at the end and the 1-byte string length at the front.

* The arrays (naturally, of words, as this was the only supported type) used one more word before the array to store the address (lvalue) of the first word of the array, rather than using a constant for that. And that lvalue was also modifiable!

* The multi-character character constants that still exist in C originate in BCPL, where you could specify up to one word's worth of characters (which might be six).

* The "break" statement (but not "continue") existed in BCPL, disappeared in B, and then reappeared in C.

* BCPL had both symbolic constants and a preprocessor, both of which disappeared in B.

Sergey Babkin on CEP and stuff

Saturday, April 19, 2025

BCPL, B, and C

No comments:

Post a Comment

Links

About Me

Labels

Blog Archive