I've finally had a chance to read carefully the descriptions by Geunter, KJell, and Ed in this thread and on Wikipedia of the Guenter/Kjell method, and I've also taken a brief look and Ed's code. The version I've coded is definitely equivalent to that described by Guenter on July 7. I can't quite tell if it is also equivalent to the specialization of Guenter described by Kjell on Oct 5. I do generate his list of 5 choices for S4,S5,S6, but I'm not sure if I use them for the same purpose and in quite the same way.
My implementation differs from Ed's in various ways, some more and some less significant. I do not use the 0..83 coding for minicolumns at all, and have no equivalent for his colBits and colcompat arrays. In fact I don't represent minicolumns at all as distinct quantities, only boxes. Each box is represented as either a 27-bit triplet of three 9-bit minicolumns, or as a 0..279 code. I convert from code to triplet by lookup, and from triplet to code by a binary search of a sorted list of triplets.
My 56x56 inner loop is almost identical to Ed's, except that I don't use the nsoln array, I put those values directly in my equivalent of his cmatrix. This avoids a level of indirection. Also, my k is a double instead of an int (by coincidence I use the same variable name), and my equivalent of cmatrix (which holds solution counts and not class id's) is a float instead of an int. Doing these things has a huge effect on speed, because for most modern processors one of the most expensive things one can do is fetch data from memory (even if it hits the data cache).
Avoiding an extra level of indirection is a clear advantage, but the use of floats is subtle and is primarily a issue on Intel architecture (IA) machines. The problem is that IA has a severly limited number of general registers (effectively 6 for complied code). Modern compilers are excellent at using them efficiently, but once a loop exceeds 6 the data it needs spills over to the stack, and if the data is read/write (e.g. k) it's going to cost two memory operations per iteration. The transition from needing 6 to needing 7 items of data is very costly. The CPU, however, has floating point registers that are separate from the general registers, and if k is a float, and the data being fetched from cmatrix are floats, everything the loop needs now fits. Furthermore, the floating point unit runs in parallel with the integer unit. Note that the cost of a floating point multiply is way less than the cost of spilling data onto the stack. (For old guys like me who grew up programming PDP/1's and PDP/8's, this is hard to get used to).
Another difference is that my version is written in C++, and so all the main data structures are wrapped in classes instead of being static arrays. This keeps the code that generates the data proximate to the declaration of the data, and provides a class interface that clarifies how the data are used. Some will find the result easier to understand (assuming I've done a reasonable job writing the code), while others prefer just straight C. This is perhaps a matter of taste, and I certainly do not wish to start a C/C++ debate on this thread.
As I said in an earlier post, I'm going to make the code available somehow as soon as I get a chance to clean it up a little and write some comments.
For the record, here is my current timing data:
- Code: Select all
Box cross 2.8 (a table needed to compute the solutions for each gangster)
Permutations 5.7 (permutation tables needed to compute the gangsters)
Band 0 0.5 (a table needed to work with band 1)
Gangsters 7.0 (compute the 44 gangsters)
Gang lists 0.4 (compute lists of compatible boxes for bands 2,3)
Gang counts 10.1 (compute number of solutions for each gangster)
Band counts 4.2 (create lookup table of gang counts)
Grid count 24.0 (count all of the grids)
Overhead 0.1
Total 54.7
Times are in ms using the clock-tick resolution Pentium hardware timer. Note that Windows XP is running in the background, and who knows what effect that has.