A Floating GCC Optimization

A couple of months ago, while debugging the build of the VM, I found a problem with the optimization of GCC 8.3 (I am not sure what other versions may show it). Basically, this GCC version does not compile one of our virtual machines functions well, because it unexpectedly removes what looks like valid code. The affected code in the VM was the following:

Spur64BitMemoryManager >> fetchLong32: fieldIndex ofFloatObject: oop
    "index by word size, and return a pointer as long as the word size"
    | bits |

    (self isImmediateFloat: oop) ifFalse:
        [^self fetchLong32: fieldIndex ofObject: oop].
    bits := self smallFloatBitsOf: oop.
    ^self
        cCode: [(self cCoerceSimple: (self addressOf: bits) to: #'int *') at: fieldIndex]
        inSmalltalk: [
            self flag: #endian.
            fieldIndex = 0
                ifTrue: [bits bitAnd: 16rFFFFFFFF]
                ifFalse: [bits >> 32]]

This method is used to extract the two 32 bits words that are stored in a Float. It handles the cases when the Float number is an immediate value and when it is not. The problem is in the part that handles the immediate value. In the Smalltalk simulator, for extracting the first Int32 it performs and a #bitAnd: and a bit shift (#>>) for the second one.

As you can see, in the last part of the method, there is a conditional that handles a special case differently in the Pharo simulated code and in the translated C code. The C code was being generated form this then:

(self cCoerceSimple: (self addressOf: bits) to: #'int *')      at: fieldIndex

And the generated C code was the following:

 ((int *)(&bits))[fieldIndex];

This is accessing the memory of the bits variable using indirect memory access. Somehow, this pattern of code was marking all usages of the variable bits as dead code, and so all dead code was being removed. To detect the problem it was not easy, basically, I have used the Pharo Tests to detect the failing function and then to understand the generation of code in GCC, we need to debug its generation. After some digging, I understood this was a problem related to some undefined behavior and aliasing of variables in the stack, which you can read more about in here and here. However, in the middle I learned how to debug GCC optimizations, so the post is about that, I hope somebody finds it interesting and useful.

Starting from a simple reproducible case

I managed to create first a simpler case showing the problem:

long __attribute__ ((noinline)) myFunc(long i, int index){
  long v;
  long x = i >> 3;
  v = x;
  return ((int*)(&v))[index];
}

#include <stdio.h>

int main(){
   long i;
   int x;

   scanf("%ld", &i);
   scanf("%d", &x);

   printf("%ld",myFunc(i,x));
}</pre>

This small function is just taking a long value (64bit integer), then it shifts the value and stores it in a temporary variable (a variable that is in the stack of the function). Finally, it performs the same indirect access that we have seen in the VM code showing the error.

So, to continue the investigation let’s compile the function with different optimization levels, as bigger the number of the level bigger the number of optimizations the compiler will do. 

Understanding the Level 1 generated code

When using the optimization level 1 (-01) it generates the following code:

myFunc:
    sarq $3, %rdi
    movq %rdi, -8(%rsp)
    movslq %esi, %rsi
    movslq -8(%rsp,%rsi,4), %rax
    ret

Before continuing let’s analyze this piece of code. We see only the code of the function myFunc.  To understand it, we need to have some little background on how the parameters are passed in the calling convention in System V x86_64 (in this case a Linux 64bits).  Basically, the register rdi is used for the first parameter, the register rsi is used for the second parameter and the register rax is used for holding the return value. The other important register here is the rsp, that holds the address to the top of the execution stack.

However, using the optimization level 2 (-02) generates a much compact version:

myFunc:
    movslq %esi, %rsi
    movslq -8(%rsp,%rsi,4), %rax
    ret

As you can see, the more optimized version is losing the bit shift and
the first mov. It keeps only the final access to the memory and the return. 

Generating the snippets

The first snippet is generated using:

$ gcc -O2 prb.c -S -o prb.s -Wall

and the second using:

$ gcc -O1 prb.c -S -o prb.s -Wall

Debugging GCC steps

The GCC compiler is implemented as a pipeline of transformations. Each transformation takes the previous result and applies one of the steps of the process. When you configure different optimization levels and compilation options different pipelines are executed. 

The GCC pipeline has different steps, so I centered myself in the
tree optimizations. To allow debugging of GCC, using the correct options it dumps each of the intermediate states of the compilation. The intermediate transformations work with C like trees. For example to get the optimized version of the program, before generating the Assembler we have to do:

$ gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-optimized=/dev/stdout

This will generate:

__attribute__((noinline))
myFunc (long int i, int index)
{
 long int v;
 long unsigned int _1;
 long unsigned int _2;
 int * _3;
 int _4;
 long int _8;

<bb 2> [local count: 1073741825]:
 _1 = (long unsigned int) index_7(D);
 _2 = _1 * 4;
 _3 = &v + _2;
 _4 = *_3;
 _8 = (long int) _4;
 v ={v} {CLOBBER};
 return _8;
}

This is the optimized SSA (static single assign) version of the function. As you can see in this version the code is already optimized, and our operations not correctly generated. I expected to see something like:

__attribute__((noinline))
myFunc (long int i, int index)
{
 long int x;
 long int v;
 long unsigned int _1;
 long unsigned int _2;
 int * _3;
 int _4;
 long int _10;

<bb 2> [local count: 1073741825]:
 x_6 = i_5(D) >> 3;
 v = x_6;
 _1 = (long unsigned int) index_9(D);
 _2 = _1 * 4;
 _3 = &v + _2;
 _4 = *_3;
 _10 = (long int) _4;
 v ={v} {CLOBBER};
 return _10;

}

In the first SSA version, we are lacking the shift operation:

x_6 = i_5(D) >> 3;
v = x_6;

So, we need to see which of the optimizations and transformations is losing our code. To see all the steps we should execute:

$ gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-all

This will generate a lot, really a lot, of files in the same directory with the name prb.c.xxx.yyyy. Where xxx is the number of the step, and yyyy is the name of the step. Each of the files contains the result of applying the changes, so what I have done is looking in a binary search where my code was lost.

I have found that the problem was in the first dead code elimination
step (prb.c.041t.cddce1). GCC does not like that we are copying a stack variable and then accessing directly as memory:

v = x;
return ((int*)(&v))[index];

So, basically, it was assuming that we were only using v and not x, so all the code modifying x is discarded (this optimization is called “dead store code deletion”). Basically, it tries to remove all the code that is affecting stack (temporary) variables that are never used.

Solving the problem

I have fixed the problem by writing the code in a different way. Actually, it was only necessary to remove the special C case in the original code, and let the Slang C code translator do his thing.

Spur64BitMemoryManager >> fetchLong32: fieldIndex ofFloatObject: oop
    "index by word size, and return a pointer as long as the word size"
    | bits |

    (self isImmediateFloat: oop) ifFalse:
        [^self fetchLong32: fieldIndex ofObject: oop].
    bits := self smallFloatBitsOf: oop.
    ^ fieldIndex = 0
          ifTrue: [bits bitAnd: 16rFFFFFFFF]
          ifFalse: [bits >> 32]]

Conclusion

I had published a smaller version of this post in the Pharo mailing list when commenting on the bug.

I wanted to store this knowledge and propagate it before I eventually flush it from my mind. I like these things, but I am not debugging the GCC optimization pipelines every day.

All in all, be careful about undefined behaviors!

Published by Pablo Tesone

I am one of the Pharo Consortium Engineers. I am an enthusiast object-oriented programming and environment!

2 thoughts on “A Floating GCC Optimization

  1. Here’s a comment by a reddit user u/boredcircuits in response to my posting this article to r/C_programming. The post has been since deleted but I could save the comment.

    Something isn’t right about his explanation at the end. He says it has something to do with the temporary variable, but I can reproduce it without any temporary at all:

    i >>= 3;
    return ((int*)&i)[index];

    In other words, x and v aren’t the problematic temporary variables. Instead the problem is the temporary pointer created by &i. A more clear picture emerges with this code:

    i >>= 3;
    int* t = (int*)&i;
    return t[index];

    The compiler is allowed to assume that different types don’t alias. t and i are completely independent, and changes to one don’t affect the other. Thus, to the compiler the shift of i doesn’t have any observable side effects and can be removed.

    There’s four solutions:

    1. ⁠Compile with -fno-strict-aliasing so GCC doesn’t make this assumption. example
    2. ⁠Use a union to perform the type punning. example
    3. ⁠Use memcpy to perform the type punning and rely on the compiler to optimize it out. example
    4. ⁠Don’t use type punning. example (Note: may provide different results depending on endianness.)

    In the end, the problem here is (and the author briefly touches on this), reliance on Undefined Behavior. Though this gets into a very controversial subject with some people.

    Like

Leave a comment