Translating C to C: Expressions

Expression evaluation is the process of translating expressions, such as the following, into machine instructions.

sum += A[i]++ ;
eolist = ( p == NULL || *p == '\0' ) ;

The classic, though rare, method

In the second programming course, expression evaluation is often done as an example of programming with stacks. For example an expression such as a*x*x + b*x + c will be parsed into the equivalent postfix, or Reverse Polish, notation, in this case a x x * * b x * + b c +. In many computer organization textbooks, this is the first step into generating stack based assembly code similar to the following:

    PUSH   a
    PUSH   x
    PUSH   x
    MULT
    MULT
    PUSH   b
    PUSH   x
    MULT
    PLUS
    PUSH   c
    PLUS

In a stack-based computer like the Burroughs B5000 from the 1960’s, an instruction like MULT would remove the top two elements of the stack and replace them with their product. Stack-based scientific calculators, such as the early HP35 and today’s TI-83, operate similarly. The popular PDF file format is also largely built on the stack-based programming language PostScript programming language.

Expression evaluation with stacks can be used on the PIC, which has PUSH and POP instructions, but more efficient code can be generated by using registers to store intermediate values. Also, C and Java have some expressions that are difficult to perform on stack. For example, in evaluating A[i]++, you can’t just put the value of A[i] on the stack and then perform the ++ because ++ needs the address of A[i], not the value of A[i].

Similarly for something like p == NULL || *p == '\0', you can’t put p == NULL and *p == '\0' on the stack and then call the stack operator OR, because you shouldn’t even attempt the evaluation of *p == '\0' when p == NULL is true.

The C-to-C solution

Instead of searching for an automatic solution to expression evaluation, we’ll try an ad hoc approach where you translate a complex expression into a sequence of simple assignments where only one operator appears on the left hand side. We’ll need to use made-up variable names to do this, just like those τ variables used in the discussion of translating C control structures.

For a while, we’re going to ignore most of those complex C expressions that involve lvalues (locations). This means you are not going to see pointers, the & operator, or structures here. That will come later.

Parsing

You will need to parse your C code. These means you must pay attention to C’s rules of precedence to know the order in which operators are applied. In a real compiler, this part of the task is usually done with code generated by a compiler compiler such as yacc or bison.

The simple operators

The simple operators are the arithmetic, bit-wise logical, and relational operators. We’ll also discuss function calls, although the function stack will be presented a bit later.

For example, a statement such as “x = z*sin(f*d) + k” would be translated to a sequence of C statements similar to the following:

τ1 = f*d ;
τ2 = sin(τ1) ;
τ3 = z*τ2 ;
x = τ3 + k;

Just notice that the there is only one operator on the right hand side of each statement.

Very simple statements, such as “x = τ3 + k” can be implemented with a couple of instructions of your target machine instruction set. Some operators, such as multiplication or division, may need to be implemented with calls to specialized functions written for your machine architecture. For example, f*d may need to be replaced with something like _MultiplyDouble(f, d) if our computer, like the PIC, does not support a floating point multiply operation. Some operators will also need to be translated into short sequences of instructions. Perhaps, a 32-bit addition will be performed as two 16-bit additions.

When you implement the relational operators, such as > and ==, you must make sure that these operators return either 0, for false, or 1, for true, as required by the C standard. For example, here is a faithful PIC implementation of the C statement “r = x > y ;”.

         CLR         r                     ;; r    <- 0
         MOV         x,WREG
         SUBR        y,WREG                ;; WREG <- x - y
         BRA         LE,1f                 ;; go to the next 1:
         INC         r                     ;; ++r only if x > y
1:

The more complex operators

There are three C short circuit operators which may not evaluate all of their operands before yielding a result. These operators are &&, ||, and the ? : ternary operator. These can be implemented using C’s if construct which can them be translating using the control structure rules.

The following table shows the translation rules for these operators.

τ = exp1 && exp2 ;

if (! (exp1))
   τ = 0 ;
else if (! (exp2))
   τ = 0 ;
else
   τ = 1 ;

τ = exp1 || exp2 ;
if (exp1)
   τ = 1 ;
else if (exp2)
   τ = 1 ;
else
   τ = 0 ;
τ = exp1 ? exp2 : exp3 ;

if (exp1)
   τ = exp2 ;
else
   τ = exp3 ;

An example

Keep in mind that those expressions in the above example must not be evaluated before their time.

As an example, let’s look at some C code to set a 16-bit integer m to the larger of the 16-bit integers x, y, and z.

if (x >= y && x >= z) {
    m = x ;
} else if (y >= z) {
    m = y ;
} else {
    m = z ;
}

If you use the rules of the control structures section, which ignores expression evaluation, you’ll get something like the following:

      int τ1 = x >= y && x >= z ;
      if (τ1 == 0) goto λ1 ;
      m = x ;
      goto λ3 ;
λ1:   int τ2 = y >= z ;
      if (τ2 == 0) goto λ2 ;
      m = y ;
      goto λ3 ;
λ2:   m = z ;
λ3:

Now we have to worry about evaluating the expression with the &&. This is going to get very messy. First, we get the following:

      int τ1 ;
      if (! (x >= y)) {
          τ1 = 0 ;
      } else if (! (x >= z)) {
          τ1 = 0 ;
      } else {
          τ1 = 1 ;
      }
      if (τ1 == 0) goto λ1 ;
      m = x ;
      goto λ3 ;
λ1:   int τ2 = y >= z ;
      if (τ2 == 0) goto λ2 ;
      m = y ;
      goto λ3 ;
λ2:   m = z ;
λ3:

Then we go back and apply the rules of the if else, and it gets even worse.

      int τ1 ;
      int τ3 = ! (x >= y) ;
      if (τ3 == 0) goto λ4 ;
      τ1 = 0 ;
      goto λ6 ;
λ4:   int τ4 = ! (x >= z) ;
      if (τ4 == 0) goto λ5 ;      
      τ1 = 0 ;
      goto λ6 ;
λ5:   τ1 = 1 ;
λ6:   if (τ1 == 0) goto λ1 ;
      m = x ;
      goto λ3 ;
λ1:   int τ2 = y >= z ;
      if (τ2 == 0) goto λ2 ;
      m = y ;
      goto λ3 ;
λ2:   m = z ;
λ3:

Huh?

Yep. That’s pretty silly. C compilers really aren t that simple minded. Most of them optimize code. First, let’s simply those comparisions.

      int τ1 ;
      int τ3 = (x < y) ;
      if (τ3 == 0) goto λ4 ;
      τ1 = 0 ;
      goto λ6 ;
λ4:   int τ4 = (x < z) ;
      if (τ4 == 0) goto λ5 ;      
      τ1 = 0 ;
      goto λ6 ;
λ5:   τ1 = 1 ;
λ6:   if (τ1 == 0) goto λ1 ;
      m = x ;
      goto λ3 ;
λ1:   int τ2 = y >= z ;
      if (τ2 == 0) goto λ2 ;
      m = y ;
      goto λ3 ;
λ2:   m = z ;
λ3:

There are some branching inefficiences here. For example, in one place τ1 is set to 0 before a transfer to λ6 where τ1 is immediately compared to 0 before transfering to λ1. Let’s speed that up a little. We can even eliminate τ1.

      int τ3 = (x < y) ;
      if (τ3 == 0) goto λ4 ;
      goto λ1 ;
λ4:   int τ4 = (x < z) ;
      if (τ4 == 0) goto λ5 ;      
      goto λ1 ;
λ5:   m = x ;
      goto λ3 ;
λ1:   int τ2 = y >= z ;
      if (τ2 == 0) goto λ2 ;
      m = y ;
      goto λ3 ;
λ2:   m = z ;
λ3:

Next we can remove a few of those unconditional goto’s by optimizing the tests.

      int τ3 = (x < y) ;
      if (τ3 != 0) goto λ1 ;
      int τ4 = (x < z) ;
      if (τ4 != 0) goto λ1 ;
      m = x ;
      goto λ3 ;
λ1:   int τ2 = y >= z ;
      if (τ2 == 0) goto λ2 ;
      m = y ;
      goto λ3 ;
λ2:   m = z ;
λ3:

You might find this a little easier to read if we just got rid of τ2, τ3, and τ4.

      if (x < y) goto λ1 ;
      if (x < z) goto λ1 ;
      m = x ;
      goto λ3 ;
λ1:   if (y < z) goto λ2 ;
      m = y ;
      goto λ3 ;
λ2:   m = z ;
λ3:

That’s what any self respecting C compiler would do.

What’s left?