n-bodies: a parallel TBB solution: serial body hot spots

In my last venture I got the n-bodies program to compile and ran a test series with the serial algorithm, showing the n-squared nature of the basic problem. I mean to write a parallel version of this (heh, heh, heh) but first I need to know what is taking up the time. By the dictates of Amdahl’s Law, I want to apply the most processors at the place the program is spending most of the time, its hot spots, to do the most good. The most common way to do this is to interrupt the processsor regularly and figure out where it is in the program, accumulating these locations to build a picture of where the HW thread (or threads) is/are spending time.  This technique is one of the several used in Intel’s most recent performance analysis tool, called Intel® Parallel Amplifier.



It installs right in Visual Studio as shown above. In order to collect hot spots on the serial algorithm, I switch the debug command to single 256 serial



I’ve also turned on symbols in my Release configuration (C/C++ >> General >> Debug Information Format set to and Linker >> Debugging >> Generate Debug Info set to on my latest build), then just click on the Profile button, and viola!



Huhhhhhh?! I see two seconds plus a quarter spent in main, but where are my functions? Do I get the same result if I try the Debug configuration?



Oh, there are my functions, runSerialBodies and addAcc, but the run takes over 5 seconds. I don’t want to spend time making Debug code run faster, so I want to tune the optimized Release code. However, something about that Release configuration is causing the functions to disappear. Experimenting a little with the configuration settings reveals that the Intel compiler is automatically inlining the functions into main. Unfortunately, apparently there’s no way to represent that inlining in the debug information so the functions just disappear. By relaxing the optimization a little, I can restore the function hierarchy for analysis at the cost of some extra function call instructions:



Now my hot spot analysis on the Release configuration looks much better:



Most of the time is being spent in the addAcc function, which is being called by runSerialBodies as can be seen in the function call hierarchy graph. Looks like addAcc will be one of my candidates for parallelization.

Next time: serial body drill-down

For more complete information about compiler optimizations, see our Optimization Notice.

Comments

Thanks, Jim. I appreciate the comments. As one "propeller head" to another, I'm more concerned with the community members who comment that C++ and TBB are too hard and see those as roadblocks in their efforts to learn about parallelism. I also want to keep the individual posts as short, separate morsels that people can assimilate in the midst of their hectic schedules.

To Sundar I'd say that I do intend to use the same scale on a log-scale graph to compare performance with the serial version, once I get that far. The parallel code is already written (several versions in fact, as posted previously) but my conceit in this blog series is to recapitulate the process I went through in developing and parallelizing the n-bodies program, taking care along the way to fill all the gaps that Jim mentions.

So thank you both for your comments. I hope you'll continue to review my posts and keep me honest. Thanks, guys.


Robert,

I wish to commend you on this blog. Too often us "propeller heads" jump from point A to point Z assuming all the in-between (B-Y) is commonly known. Your steps, including the stumbling blocks (assumed commonly known steps) is refreshing to see. Keep up the good work.

Jim Dempsey

www.quickthreadprogramming.com


Can you draw the performance graph on the same scale with your seriel implementation performance graph? Recently I turned my self-organizing map into parallel using Parallel advisor lite and TBB. A simple graph on a log-scale helped me visualize the gain on parallel processing as the SOM started converging iteratively.