
This weekend I thought it would be a good idea to run the Eressea server both with and without optimizations enabled and compare the output. In theory, I thought, optimization should not change the results, and different results would hint at bugs like uninitialized variables or illegal memory access.
Needless to say, the output wasn’t the same. It was slightly different, and it looked like a small error snowballing towards the end. I’ll spare you the tale of a day trying to narrow down the exact location, and cut right to the chase:
There are more optimization options than you can shake a gnu at, and it takes time to find out which one is breaking your code. Like visual C++ (/fp), GCC has optimization options that change the behavior of floating-point operations. And since they change the result of your program and the compliance with IEEE and ANSI standards, they are disabled except when directly specified. For example:
-funsafe-math-optimizations Allow optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards. When used at link-time, it may include libraries or startup files that change the default FPU control word or other similar optimizations. This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions. The default is -fno-unsafe-math-optimizations.
Now, we all know floating point math is a dark art. Processors have different size registers, some numbers (like 0.7) cannot be accurately described, and rounding is a science in itself. This is why there are IEE standard for what exactly should happen. Of course, following those standards clashes with optimization. So after some poking around in the wrong places, I began to get suspicious of floating point math. And after a lot of searching in the wrong lace, I finally narrowed my bug down to this little test program:
int main(int argc, char**argv) { float f = 0.7F; printf("%d\n", (int)(f*100)); return 0; }
I know, I know. Casting is evil. But knowing that 0.7 is one of those numbers we can’t represent in a floating-point register, I wasn’t surprised to see that this prints out 69. I was however surprised that with -Os optimizations, it prints 70. And here’s why:
main: pushl %ebp movl %esp, %ebp pushl $70 ; WTF? pushl $.LC2 call printf
WTF is the point of disabling all the other math optimizations by default, if you’re going to do this kind of thing and make the whole reproducibility of results moot? It turns out there’s one optimization option that GCC does enable by default, which causes this:
-ffloat-store Do not store floating point variables in registers, and inhibit other options that might change whether a floating point value is taken from a register or memory. This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a double is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying them to store all pertinent intermediate computations into variables.
And when I turned that one off, my output was the same, as it should be. Lesson learned. I have to say though, things are a lot easier in visual C++: I have three options for math, /fp:{precise,strict,fast} and they are a lot more intuitive.
Sorry for hijacking your post for non-related topic, but I have to ask, did you mean Time Traveller’s Wife? Cos I couldn’t find The Time Traveller’s Daughter in Amazon (nor have I heard of it before) 😉
Hehe, no problem. And yes, I meant Wife, not Daughter. I just saw there’s going to be a film! Please, New Line, don’t screw this up!
But wouldn’t it be nice if there was a sequel? One can dream…