## Fast 3x3-counting

Everything about Sudoku that doesn't fit in one of the other sections

### Fast 3x3-counting

Up to now, the currently fastest 3x3-grid count has been done by the band by band method. The fastest implementation did the counting in 36.1 milliseconds (2GHz), and belongs to PatmaxDaddy from December 15 last year:
Code: Select all
`Band Counter   0.4Permutations   4.8Band 1         0.3Gangsters      3.6Gang lists     0.4Gang counts    0.2Band counts    2.2Grid count    24.2Overhead       0.1Total         36.1`

The crucial part of this method is the main counting which is an 3863552-size loop. This number is given by 44 (gangsters) * 56^3 (choices for boxes on second row) / 2 (band 2<->3 swapping).

I have a new counting method that does the job in 30.2 milliseconds, but this is on a 3GHz, so I still think PatmaxDaddy is best. But I'm sure this new method has a potential, particularly if PatmaxDaddy starts picking at my code...

The method is as follows, First choose the configurations for the columns in box 2 and 3, and the rows for box 4 and 7:
Code: Select all
`+---+---+---+|   |col|col|+---+---+---+|row|   |   |+---+---+---+|row|   |   |+---+---+---+`

This means that we specify which numbers to place in which column in box 2, but not in which order.

With the configuration given as above, the completion of box 1,2,3,4,7 is independent from the completion of box 5,6,8,9. Thus the two completions can be calculated independently and multiplied together, which is the strength here. There is also a degree of symmetry in the configuration set, which helps me to reduce them down to 2865 equivalence classes.

Box 1,2,3,4,7 completion: There are 36 choices for the column configuration of box 4, and then either 8 (9 out of 10) or 0 (1 out of 10) choices for a column configuration of box 1 and 7. Then we look up the number of band 1 completions from their column configuration as done in the band by band method. The average number of loops is then 36*8*9/10 = 259.2. With 2865 classes, the excepted number of loops becomes 742608. The actual number is 761680.

Box 5,6,8,9 completion: There are 56 choices for the row configuration of box 5 such that it fits with box 4, then the rows in box 6 are given. It is then 8*9/10 choices for the column configuration of box 5 and 8 (as for 12347-counting). Then, with the rows of box 6,7 and columns of box 3,8, we look up the number of completions. Average number of loops is 56*8*9/10=403.2. With 2865 classes, the expected number of loops is 1155168, actual number is 1145600.

In total, we have 1145600+761680=1907280 loops.

In addition, there are loops for creating the lookup-tables, so the total is about 2200000, much less than 3863552 for band by band counting.
kjellfp

Posts: 140
Joined: 04 October 2005

### Re: Fast 3x3-counting

Well, to call this method new is maybe not entierly true, as this is very much the B12347-method. But there is no brute force grid by grid counting needed here.

I should also put comments on why I use as much as 30.2 ms, when I only have half as many interations as for the band by band method.

As mentioned, there are two loops. The one used for 12347-counting is certainly (as I see it) compiled to use only registries. It looks that I have failed to acheive this for the second loop, though I thought I was doing it right. I think this is the part with the best possibilities for an improvement.
kjellfp

Posts: 140
Joined: 04 October 2005

The methods developed in the B2347-algorithm in this thread, together with Patmaxdaddy's trick of using floating types in the final count, have been applied on the ordinary band by band algorithm. This is now my current record for 3x3 grid counting:
Code: Select all
`Lookup table generation:    2.0 msGangster generation    :    0.5 msMain counting          :   14.2 ms=====Total                  :   16.7 ms`

This is on a 3GHz machine. The B2347-algorithm is now down to 19.4 ms. It can probably be improved further.
kjellfp

Posts: 140
Joined: 04 October 2005

Great work! I haven't had time to study your new 3x3 method, but I have been thinking some about 6xC and RxC in general. I'll post something in the band counting thread eventually.

Posts: 63
Joined: 18 October 2005

I'm down to 16.4 ms on the B2347-algorithm. The main ideas of the algorithm are by the way examined earlier.
kjellfp

Posts: 140
Joined: 04 October 2005

I haven't been following for a while and
I don't quite understand, how you can do this so fast !
When I did it with the 2865 B12347-classes, it took hours.

It seems now, that both methods (B123456,B12347) are of similar speed,
upto a factor of 2.
dukuso

Posts: 479
Joined: 25 June 2005

dukuso wrote:I haven't been following for a while and
I don't quite understand, how you can do this so fast !

Great to see you're back!

There are three jobs that have to be optimized in order to minimize the speed.

1. Equivalence class grouping
- We can always assume B3 to be the "canonical box", i.e. 123/456/789.
- We can then assume B2 to be one of the five "base boxes", as the action of number permutations leaving the configuration of the canonical box fixed, will group the 280 box configurations into 5 equivalence classes.

This has been mentioned by me and PatmaxDaddy, and probably several others have seen this as well. But what I did was to move even one step further:

- For a given base box B, we can let the permutations leaving both the canonical box and B fixed, act on the 280 box configurations to get equivalence classes for the choices for B4. For the different 5 choices of B, this gives in total 131 equivalence classes. That is, we get 131 equivalence classes of tripples of boxes (C1,C2,C3) under the action of permuting the elements within any minirow, permuting the rows in a box, and permuting the 9 symbols. (If we in addition also allow ourselves to permute the three boxes, we get the well-known set of 44 equivalence classes. This is used to get gangster grouping done quickly)

So I only need to work on a 131 x 280 - size set of choices for (B3,B2,B4) and B7. There are few automorphisms for a member in most of these cases, so I don't need very much time finding the size and mark the elements of each of the 2865 classes.

2. B12347 completion
With the columns of B3 and B2 given (happens early in the loops), I can find all coices for the rows of B1, and store the list. Then, when the rows of B4 and B7 are given, this is just a lookup job on ordinary band completion as for the band by band method.

3. B5689 completion
With the columns of B3 and B2, and the rows of B4 given, we can precalculate all choices for rows in B5 and B6, and columns in B5 and B8. Then for each choice of rows for B7, we have a lookup job again on the number of B689 completions in this configuration:
Code: Select all
`+---+---+---+|   |   |col|+---+---+---+|   |   |row|+---+---+---+|row|col|   |+---+---+---+`

Needed lookup-tables are of course precalculated.
kjellfp

Posts: 140
Joined: 04 October 2005

I've left cygwin and tried Visual C++ as a compiler. That gave some further improvement on 3x3-counting with the B2347-method:

Code: Select all
`Lookup table generation:  1.7 msGangster generation    :  0.3 msMain counting first    :  2.6 msLower right count      :  3.0 msMain counting second   :  7.2 ms=====Overall                : 14.8 ms`
kjellfp

Posts: 140
Joined: 04 October 2005

I'm losing track. Can you give side-by-side timings, with the same compilation and runtime environment, for your B2347 method, your/Guenter's earlier band generation method and whatever PatmaxDaddy's offering at the moment. Would be nice to be clear on which is the quickest.
Red Ed

Posts: 633
Joined: 06 June 2005

Well, I only tried the B2347-method with Visual C++. It now turns out that band by band is even faster.

So here you are. My two implementations on 3GHz CPU, with Microsoft Visual C++ Toolkit 2003:

Band by band:
Code: Select all
`Lookup table generation:  1.1 msGangster generation    :  0.3 msMain counting          : 11.9 ms=====Overall                : 13.3 ms`

B2347:
Code: Select all
`Lookup table generation:  1.7 msGangster generation    :  0.3 msMain counting first    :  2.6 msLower right count      :  3.0 msMain counting second   :  7.2 ms=====Overall                : 14.8 ms`

A bit odd, since the two programs were about equally fast on the cygwin/g++ compiler, with B2347 a tiny bit ahead.

PatmaxDaddy's downloadable code from December 7 last year doesn't compile properly in my environment. He has also improved the code since then. So I'm not able to compare at the moment.
kjellfp

Posts: 140
Joined: 04 October 2005

When I wrote the above one year ago, I was wondering why B2347 was slower than band by band, as it had half as many loops. I guessed the answer was hidden in the L2-cache size of the computer. I have now got my first computer with an 2 Mb L2-cache, and that made a big difference on the performace. With 2 CPUs of 1.83 GHz, I get

Code: Select all
`Lookup table generation: average   1.40 msGangster generation    : average   0.31 msMain counting          : average  12.50 ms=====Overall                : average  14.21 ms`

for the band by band alorithm, this was even worse than before beacuse the CPU here is slower. However, B2347 on the same computer went down to

Code: Select all
`Lookup table generation: average   1.87 msGangster generation    : average   0.31 msMain counting first    : average   2.03 msLower right count      : average   1.72 msMain counting second   : average   2.50 ms=====Overall                : average   8.43 ms`

which clearly is the fastest I've seen ever. So I guessed right last year, B2347 is the fastest, if you only have big enough cache in the memory.
kjellfp

Posts: 140
Joined: 04 October 2005