Comparison of Intel and GNU compilers and incomplete scalability analysisHardwareCPUs: 2 x Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz (8 cores per CPU + hyperthreading, 20MB cache)
Intel compilerversion: icc version 14.0.0 (gcc version 4.4.7 compatibility)
command line switches: -openmp -ipo
GNU compilerversion: gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC)
command line switches: -fopenmp -O3 -march=native
Tested toolThe tool does the following
- Loads a list of puzzles from a file (serial code)
- Converts them to canonical form and removes duplicates (serial code)
- Removes N givens from each puzzle in all possible ways, converts the resulting sub-puzzles to canonical form and removes duplicates (serial code)
- M times adds a clue in all possible ways to all puzzles, checks for minimality, converts resulting puzzles to canonical form, checks for unique solution, and removes duplicates (OMP parallel code)
- Stores the puzzles to a file (serial code)
It accepts N and M as parameters, loads from stdin, and stores to stdout.
The testThe code is compiled with Intel and GNU compilers into separate binaries.
The number of parallel threads is controlled by environment variable OMP_NUM_THREADS.
The timings of each run are measured once by "time" command, and once by clock() function.
All runs are done with the same parameters N=4, M=5, and on the same dataset consisting of the two known 40-given minimal sudoku puzzles.
Runs with 1,2,4,8,16,32,64 and 33 parallel threads are done for both binaries.
All runs are executed from a shell script with 5 second pause between consecutive runs.
The runs with up to 16 threads are expected to be linearly scalable since the hardware has 16 physical cores.
The runs with 32 threads are expected to give ~20% better results than those with 16 since the hyperthreading share the ALU.
The runs with 33 and 64 threads are expected to run a little slower or at the same rate as 32 threads since there are no IO operations and the hardware resources are limited to 32 threads.
The full test was performed 3 times. The timings are similar, but instead of averages, the results from a single test (the first one) are analysed. About 10 partial test have been executed in the 8-64 parallel threads area.
During the tests no other CPU consuming processes have been run.
ResultsRaw results are in the table, the rows for real, user and sys times reported by OS time utility, and the clock() rows reported by the binary.
The overhead rows are calculated on the basis that the serial code takes 5 seconds and there are sufficient HW resources for each parallel thread.
(The serial code timings have not been explicitly measured but is estimated to be between 1.9 and 6.5 seconds.)
The real time Intel/GNU ration is measured by dividing the time consumed by Intel-compiled binary to the time consumed by GNU-compiled binary.
The values in the picture in textual form
- Code: Select all
OMP_NUM_THREADS 1 2 4 8 16(physical) 32(logical) 64 33
Intel real 15m15.308s 7m40.323s 3m57.776s 2m6.339s 1m17.919s 0m53.717s 2m22.400s 1m40.729s
user 15m14.504s 15m15.551s 15m37.508s 16m19.855s 19m43.421s 25m35.112s 32m20.851s 30m4.700s
sys 0m0.037s 0m0.056s 0m0.031s 0m0.067s 0m0.074s 0m0.308s 41m14.614s 16m34.348s
clock(), s 915 916 938 980 1183 1535 4415 2799
overhead 0% 2% 7% 28% 71%
GNU real 17m57.704s 8m59.814s 4m38.910s 2m27.984s 1m31.976s 1m8.790s 1m3.142s 1m5.969s
user 17m56.760s 17m53.548s 18m19.413s 19m8.075s 23m14.564s 27m43.283s 30m59.307s 29m8.941s
sys 0m0.038s 0m0.034s 0m0.023s 0m0.035s 0m0.073s 0m0.118s 0m0.304s 0m0.145s
clock(), s 1077 1074 1099 1148 1395 1663 1860 1749
overhead 0% 2% 7% 30% 90%
real time Intel/GNU 85% 85% 85% 85% 85% 78% 226% 153%
- scalability_table.PNG (14.8 KiB) Viewed 4581 times
- scalability_pic.gif (3.93 KiB) Viewed 4581 times
InterpretationIn general Intel compiler shows stable 15% faster code.
This partially could be due to the fact that years ago heavy work have been done in the optimization of the internal loops in the solver and then Intel compiler has been used.
The only part in code where explicit difference exist (via a macro) is the unrolling of one 9-iterations loop, where it has been extensively tested that Intel compiler performs better with unrolling but GNU compiler performs better without unrolling.
At the edge of the threads (32), Intel compiler performs 22% faster, but both binaries don't work well.
The overhead is unexpectedly high, at lest for 16 threads and above.
Later attempts in minimization of the locked time have been done. They resulted in few percents better performance even for single thread, but virtually unchanged scalability.
The tested code flushes at once the puzzles suitable for output or next iteration obtained by adding a clue to a single puzzle. An experiment to flush results from 100 source puzzles at once gave worse result. Flushing any new puzzle immediately gave better result but no measurable change in the scalability.
Intel's binary performs very bad when overloaded. Even enforcing the thread count from 32 to 33 leads to disaster degrading performance almost twice. The user time increases but the boost comes from system time.
GNU performs better with more threads. User and system times slightly increase, but the real time decreases. This should be covered by the reduced idle time.
One explanation of this paradox is that compilers use different OMP libraries or at least different default values for the policies used by the library.
In only one test (out of more than 10), for 33 threads, Intel's binary performed just like the GNU one, slightly increasing user/sys but decreasing the real time.
The modern OMP runtime can monitor the CPU load and automatically reduce the load by its threads. This behavior was not in the scope of these tests.
ConclusionsIntel compiler performs 15% better on single thread and 22% better on the default number of threads = number of logical CPUs.
For the binary compiled with Intel's compiler, if the number of threads is enforced to the number of logical CPUs, any other CPU-consuming process could drastically degrade the performance.
Leaving the runtime to determine the number of parallel threads could save the process from the observed overloading disaster.
GNU compiler demonstrated better behavior in the cases where it has been simulated that the user doesn't know what he is actually doing.