C is "portable assembler". If I do something like if (foo = bar) { } I know that...

Xurinos · on March 25, 2012

This just isn't true. C is not portable assembler. It was never intended to be. I hear it claimed, and it is wrong every time somebody calls C a low-level language close to assembler. You can make some roughly reasonable assumptions about what comes out of the compiler, but often it is not what you think it is.

Let's challenge this specific claim, that when you do "if (foo == bar)" -- I corrected the syntax error, which is a symptom of C's high-level syntax and not of the underlying assembly code -- you compare one value to another and then jump. For this challenge,I will write some trivial code that we should be able to make easy assumptions about, and I will compile it with debugging enabled so that I can dump the results with gdb.

  $ gcc -g example.c

  1       #include <stdio.h>
  2
  3       int main() {
  4          int foo = 10;
  5          int bar = 20;
  6          if (foo == bar) {
  7             printf("Fun\n");
  8          }
  9          return 0;
  10      }

  Dump of assembler code for function main:
  0x0000000100000ef8 <main+0>:    push   rbp
  0x0000000100000ef9 <main+1>:    mov    rbp,rsp
  0x0000000100000efc <main+4>:    sub    rsp,0x10
  0x0000000100000f00 <main+8>:    mov    DWORD PTR [rbp-0x4],0xa
  0x0000000100000f07 <main+15>:   mov    DWORD PTR [rbp-0x8],0x14
  0x0000000100000f0e <main+22>:   mov    eax,DWORD PTR [rbp-0x4]
  0x0000000100000f11 <main+25>:   cmp    eax,DWORD PTR [rbp-0x8]
  0x0000000100000f14 <main+28>:   jne    0x100000f22 <main+42>
  0x0000000100000f16 <main+30>:   lea    rdi,[rip+0x19]        # 0x100000f36
  0x0000000100000f1d <main+37>:   call   0x100000f30 <dyld_stub_puts>
  0x0000000100000f22 <main+42>:   mov    eax,0x0
  0x0000000100000f27 <main+47>:   leave  
  0x0000000100000f28 <main+48>:   ret

We see that in the very basic version of this code with absolutely no optimizations and doing the silliest things that we can, we store our two values into some memory locations, perform a comparison (cmp), and jump if not equal. We can see that the jump leads us to the puts() call.

Now, let's get smarter. The variables foo and bar do not change value, and we only work with two variables in the routine. Therefore, we could optimize by storing those values in temporary registers instead of using expensive memory transfers. Further,since our two constants are being compared and will always return a false, we actually have a section of code -- the printf -- that is dead code, that can be completely removed from final compilation. Well, that's simple, and everyone who uses C in production at least turns on some minor optimization:

  $ gcc -g -O1 example.c  # the only difference is the -O1

  Dump of assembler code for function main:
  0x0000000100000f34 <main+0>:    push   rbp
  0x0000000100000f35 <main+1>:    mov    rbp,rsp
  0x0000000100000f38 <main+4>:    mov    eax,0x0
  0x0000000100000f3d <main+9>:    leave  
  0x0000000100000f3e <main+10>:   ret

This does not look like our C code at all! And thankfully so! What a waste of space and CPU time it would have been had we treated C like an interpreted language! C is a high-level language with numerous compiler implementations that can intelligently convert the human-readable code into the binary code that represents the real situation behind the code.

The point here is that you are not properly guessing the assembler code that will be produced. The compiler is doing a better job of that; that is the compiler's job. As a programmer, you can just focus on the algorithm. C is not an assembler macro language. For that, you would use things like "gas".

kevinnk · on March 25, 2012

C is not assembly and hasn't been for a very long time. But I think when people use the people use the phrase "portable assembler" they really mean that in C you both control the memory layout of data types very finely and that code maps very directly to an equivalent assembly construct. True, optimizers frequently change the actual executed code from what what we expect, but C gives a very intuitive feel of what the "upper bound" assembly output is.

For example in C "array[0] = (x + y);" will never be more than a couple assembly instructions long. In many languages, including Haskell (and in the case of operator overloading, C++), the equivalent construct might map to hundreds if not thousands of instructions. Or it might map to the same one or two that C would emit. It's impossible to know and there is no reasonable upper bound on what could happen.

stcredzero · on March 25, 2012

might map to hundreds if not thousands of instructions. Or it might map to the same one or two that C would emit. It's impossible to know and there is no reasonable upper bound on what could happen.

Over every possible piece of code that could be compiled anywhere, this might well be true. But for a properly informed programmer for a given piece of code, not so much.

kevinnk · on March 25, 2012

>But for a properly informed programmer for a given piece of code, not so much.

There are a couple reasons that even for "informed" programmers this is still important

1) For most dynamic languages, even simple operations can take a highly variable amount of time to execute. How many instructions does an array access take in Javascript? The answer depends on everything from the state of the JIT to the types involved, both of which are usually impossible to know before hand. In C we can answer this pretty easily.

2) The modern trend is towards writing more and more generic code. Even for statically compiled languages like C++ and Haskell, the actual underlying operations are purposely* abstracted away from you. Unless you know every possible instance that your code could be used it is impossible to know how long any operation will take.

And all this is assuming that the programmer knows everything about their compiler, assembler, standard library, imported libraries, ect, which isn't true for all but the most expert programmers.

*Admittedly, the actual length of time it takes is dependent on the state of the processor which can be very difficult to predict, but we will have a lot more information than we would have had otherwise.

stcredzero · on March 25, 2012

You need to take both the "informed" and "given." Not all pieces of code are "cross platform" and even within that, there's different levels.

In other words, you're talking about one end of the spectrum. You are right, though, that things are moving in that direction.

derleth · on March 25, 2012

In general, I agree with you. I just feel the need to expand on a few points.

> in C you both control the memory layout of data types very finely

True to an extent.

> code maps very directly to an equivalent assembly construct

True but less and less relevant.

Here's where C disconnects you from the processor in the ways that matter most:

1. malloc()/free() are too high-level: You can't control where the allocation subsystem gets your next chunk from, you can't see whether your malloc arena is getting full, you can't see whether you're about to double-free something, and you have no way to recover from a failure to allocate (if that's even possible on your OS).

2. C has no concept of cache; admittedly, assembly usually tries to hide it from you to an extent as well, but assembly language at least has hooks into the cache hardware in the form of memory barriers. C doesn't even have that much.

3. C completely hides the processor status word from you. A minor concern, usually, except in precisely the kind of tight loops people most advocate C for.

4. C has no concept of out-of-order execution or opcode pairing or pipelining in general. Just hope your compiler does.

So, added up, that means C is farther and farther from the hardware all the time. It was reasonably close on the PDP-7 where it was born, was fortuitously even closer to the PDP-11 where it was later implemented, and remained fairly good for a while after, but once you get to dual-core superscalar designs with cache hierarchies and SIMD hardware, you have to rely on the compiler to turn your C into good assembly. Which is, really, a lot like what you do when you write Haskell.

Danieru · on March 25, 2012

"I know that the computer is comparing one value to another"

Actually it isn't.