n-bodies: a parallel TBB solution: serial body hot spots

In my last venture I got the n-bodies program to compile and ran a test series with the serial algorithm, showing the n-squared nature of the basic problem. I mean to write a parallel version of this (heh, heh, heh) but first I need to know what is taking up the time. By the dictates of Amdahl’s Law, I want to apply the most processors at the place the program is spending most of the time, its hot spots, to do the most good. The most common way to do this is to interrupt the processsor regularly and figure out where it is in the program, accumulating these locations to build a picture of where the HW thread (or threads) is/are spending time.  This technique is one of the several used in Intel’s most recent performance analysis tool, called Intel® Parallel Amplifier.



It installs right in Visual Studio as shown above. In order to collect hot spots on the serial algorithm, I switch the debug command to single 256 serial



I’ve also turned on symbols in my Release configuration (C/C++ >> General >> Debug Information Format set to and Linker >> Debugging >> Generate Debug Info set to on my latest build), then just click on the Profile button, and viola!



Huhhhhhh?! I see two seconds plus a quarter spent in main, but where are my functions? Do I get the same result if I try the Debug configuration?



Oh, there are my functions, runSerialBodies and addAcc, but the run takes over 5 seconds. I don’t want to spend time making Debug code run faster, so I want to tune the optimized Release code. However, something about that Release configuration is causing the functions to disappear. Experimenting a little with the configuration settings reveals that the Intel compiler is automatically inlining the functions into main. Unfortunately, apparently there’s no way to represent that inlining in the debug information so the functions just disappear. By relaxing the optimization a little, I can restore the function hierarchy for analysis at the cost of some extra function call instructions:



Now my hot spot analysis on the Release configuration looks much better:



Most of the time is being spent in the addAcc function, which is being called by runSerialBodies as can be seen in the function call hierarchy graph. Looks like addAcc will be one of my candidates for parallelization.

Next time: serial body drill-down

For more complete information about compiler optimizations, see our Optimization Notice.