Optimizing code via compiler flags

Posted by Steve on Mon 12 Sep 2011 at 19:12

Tags: , ,

When you're developing performance-critical code you will most likely receive performance increases in one of two ways; via the selection of the most appropriate algorithmic solution, or via the expense of additional hardware. Here we'll look at an interesting alternative - optimising via compiler flags.

Genetic algorithms have fascinated me for years, and one interesting project I've always been vaguely curious about is the research-based acovea tool. Given a small benchmark program which is a current bottleneck in your code this tool will recompile it over and over and over again, with different compiler flags.

After compiling the code with a given set of compiler flags, and timing the result, the best results will be kept and "bred" together. In short if you have a suitable program you can expect to return (many hours later) with a faster program with no human involvement.

To demonstrate this I've put together a trivial prime-number factoring program. This simple script does nothing fancy, and coule be optimised algorithmically with ease - however it is a suitable demonstration. (I suspect it'd be almost a thousand times faster without the I/O caused by printf. Ignore that!)

The sample code is as follows:


#include<stdio.h>

int
main (int argc, char *argv[])
{
  int i = 1, num;
  int j, k;

  for (num = 2; num < 32768; num++)
    {
      while (i <= num)
	{
	  k = 0;
	  if (num % i == 0)
	    {
	      j = 1;
	      while (j <= i)
		{
		  if (i % j == 0)
		    k++;
		  j++;
		}
	      if (k == 2)
		printf ("%d is prime\n", i);
	    }
	  i++;
	}
    }
  return 0;
}

Once you've saved this code to ~/bench.c you can compile it and time its execution with:

skx@birthday:~$ gcc -o bench bench.c 
skx@birthday:~$ time ./bench 
...
...
real	0m9.487s
user	0m9.273s
sys	0m0.008s

The execution time on your system may be different, but for me a simple execution took 9.5 seconds. Now lets make it faster. We'll start by installing the appropriate package:

root@birthday:# apt-get install acovea
Reading package lists... Done
Building dependency tree 
Reading state information... Done
...
The following extra packages will be installed:
  libacovea-5.1-5 libcoyotl-3.1-4 libevocosm-3.1-1
Recommended packages:
  acovea-results
The following NEW packages will be installed:
  acovea libacovea-5.1-5 libcoyotl-3.1-4 libevocosm-3.1-1
..
Setting up libevocosm-3.1-1 (3.1.0-3.1) ...
Setting up acovea (5.1.1-2) ...

Once installed we can then start working. The first thing we need to do is load a "profile". A profile is a simple XML file which describes the available compiler flags which may be tested - so you could update the tool to work with Intels icc compiler, for example.

When installed several default profile are deployed beneath /usr/share/libacovea/config, so if you do wish to tweak things that is where you should start from.As I'm running on an Intel host I started the process by executing:

skx@birthday:~$ runacovea -config gcc34_intel.acovea -input bench.c

Note: With no tweaks to the number of population sizes, generations, or similar this compilation and benchmarking effort took most of a day. and the end result was:

generation 20 complete, average fitness: 461.928

Acovea completed its analysis at 2011 Sep 06 19:06:37

Optimistic options:

                       -ftree-copyrename  (1.509)
                       -fmerge-constants  (2.071)
                          -fcrossjumping  (1.951)
                      -fcse-follow-jumps  (1.911)
                             -fno-inline  (1.951)
                                -ftracer  (1.79)

Pessimistic options:

               -momit-leaf-frame-pointer  (-1.545)
                        -floop-optimize2  (-1.786)
                             -fforce-mem  (-1.786)
                            -fsched-spec  (-1.505)
                      -funroll-all-loops  (-1.626)
          -fbranch-target-load-optimize2  (-1.706)
                            -mfpmath=387  (-1.545)
                            -mfpmath=sse  (-1.746)

Acovea's Best-of-the-Best:
gcc -lrt -lm -std=gnu99 -O1 -march=opteron -fno-thread-jumps \
    -fno-if-conversion2 -fno-delayed-branch -fno-loop-optimize \
    -ftree-dce -ftree-sra -ftree-copyrename -ftree-fre -ftree-ch  \
    -fmerge-constants -fcrossjumping -fcse-follow-jumps -fcse-skip-blocks \
    -fexpensive-optimizations -fstrength-reduce -frerun-cse-after-loop  \
    -frerun-loop-opt -fforce-addr -fpeephole2 -fschedule-insns2 -fregmove \
    -freorder-blocks -fsched-interblock -funit-at-a-time \
    -falign-functions -falign-jumps -falign-loops -falign-labels \
    -finline-functions -fno-inline -ftracer -fmodulo-sched -fgcse-sm \
    -freschedule-modulo-scheduled-loops -ftree-loop-im -ftree-loop-ivcanon \
    -maccumulate-outgoing-args -mno-align-stringops -D__NO_MATH_INLINES \
    -funsafe-math-optimizations -fno-signaling-nans \
    -o /tmp/ACOVEAF92D8CC5 bench.c

As you can see it has managed to generate a simply horrifically large compile command - with many options I've never seen before. However the important question is how well did it do?

Recompiling the code manually using the suggested flags yielded an execution time of 7.5 seconds. No code changes, yet the program became faster by 2 seconds. That is a pretty astonishing result, albeit one we could have probably replicated via a better algorithm in the first place.

For this tool to be useful I expect you'll need to be writing code which is CPU-bound in some fashion, but which is still simple enough to condense into a useful benchmark. Given that you must compile and execute the benchmark many times you can't expect speed which is a potential downside. (i.e. If you have to "simplify" your real code to make it worth trying acovea on it you might find the suggested results just don't apply to the real thing.)

 

 


Posted by Anonymous (82.8.xx.xx) on Tue 13 Sep 2011 at 21:17
You should have compared the acovea optimimum with the "usual" optimization flags (-O2) and not with the unoptimized compilation. Do you have the numbers for that?

[ Parent | Reply to this comment ]

Posted by Steve (90.193.xx.xx) on Tue 13 Sep 2011 at 21:25
[ Send Message | View Steve's Scratchpad | View Weblogs ]

That's a good point. Numbers below.

$ for i in -O0 -O1 -O2 -O3 ; do gcc $i -o ./x-$i bench.c ; done

$ time ./x--O0
real	0m8.269s
user	0m9.677s
sys	0m0.260s

$ time ./x--O1
real	0m8.170s
user	0m8.268s
sys	0m0.000s

$ time ./x--O2
real	0m8.172s
user	0m8.276s
sys	0m0.000s

$ time ./x--O3
real	0m7.775s
user	0m7.776s
sys	0m0.000s

Hrm. The "-O3" version is almost as good as the acovea result. That kinda ruins my happy happy joy joy moment!

Steve

[ Parent | Reply to this comment ]

Posted by jiteo (98.124.xx.xx) on Sat 1 Oct 2011 at 18:54
[ Send Message ]
It's still a nifty idea - I enjoyed reading this. Although I think the article is mis-titled. It's not so much about optimizing code with compiler flags (using -O3 is optimizing code with compiler flags), it's about using genetic algorithms to figure out what compiler flags to use.

[ Parent | Reply to this comment ]

Sign In

Username:

Password:

[Register|Advanced]

 

Flattr

 

Current Poll

Which init system are you using in Debian?






( 1062 votes ~ 6 comments )