## Unbiased grid generation

Everything about Sudoku that doesn't fit in one of the other sections

### Re: Unbiased grid generation

holdout wrote:Do you think that starting with four digits is a realizable goal?
I don't think it's a good goal to be shooting at. I've nearly finished a much faster method that uses the 2865 B2347 classes instead. This is, I believe, what dukuso was alluding to when he started this thread a year and a half ago, although he wasn't specific about the equivalence relation he had in mind at the time. Will post something intelligible about that in a week or so (had stopped working on it, actually, because I thought interest was low.)
Red Ed

Posts: 633
Joined: 06 June 2005

### Re: Unbiased grid generation

Red Ed wrote:Finally, an application. We'd like to know how many proper (solvable, not necessarily minimal) sudoku puzzles exist. To answer this, we generate grids in an unbiased manner and, for each one, try to solve a random subgrid (that is, clear each cell independently with probability .5 then ask if what's left has a unique solution). Let P be the probability that a puzzle generated this way is proper; then the number of proper puzzles in total is P * 6.67e21 * 2^81. In 178100 trials, I found 65325 proper puzzles, so a 95% confidence interval for the total number of proper puzzles is [ 5.88e45, 5.95e45 ] to 3 s.f.

this relates to the canonical form thread:
would it be possible to design the trials based on an unbiased stream of min-lex canonical grids?
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

I think that getting the unbiased stream would be a pain: you'd have to weight each grid by 1 / (nr automorphisms).

Better just to resample each grid lots of times and adjust the confidence intervals accordingly. With say 1 million unbiased grids (which I happen to have lying around anyway) and say 1000 puzzle-testing trials on each, it should be possible to get much tighter bounds on the overall number of puzzles than I gave previously. It's just a question of grinding through CPU cycles and being a little bit careful with the stats.
Red Ed

Posts: 633
Joined: 06 June 2005

Red Ed wrote:I think that getting the unbiased stream would be a pain: you'd have to weight each grid by 1 / (nr automorphisms).

I meant reformulating the answer around the 5e9 essentially different grids instead of the 6e21 total
5e9 is small enough to sample 1e5 random 0.5 filled grids on each

I'm trying to see how automorphisms come into play
they don't affect the
Code: Select all
`6.67124817229145899e+21 = (5472730538) * (9*8*7*6*5*4*3*2*1) * (6*6*6*6*6*6*6*6) * (2)`

grid counting, but, aha, they do affect valid puzzle (partial grid) counting, because for every
puzzle projected on a solution grid with > 1 automorphisms there are #automorphisms
equivalent puzzles

so keeping the #automorphisms for each of the 5e9 essentially different grids should be
enough to normalize the samples
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

gsf wrote:I'm trying to see how automorphisms come into play
they don't affect the
Code: Select all
`6.67124817229145899e+21 = (5472730538) * (9*8*7*6*5*4*3*2*1) * (6*6*6*6*6*6*6*6) * (2)`
grid counting
Actually yes they do affect the grid count. The formula is:
Code: Select all
`nr grids = 6670903752021072936960 = 5472730538  * 9! * 6^8 * 2 / A`
where A ~= 1.00005163 is the average number of automorphisms among all solution grids.

(... pause for a commercial break ...)
And now back to the main topic. Conscious that gsf will probably have tabulated all of sudoku in a posix-approved way before I finish typing this, I have finally got around to implementing the B2347 method for unbiased solution grid generation. It flies: I can now produce >10000 grids/sec, which is a thousand-fold improvement on the method described at the start of this thread. The slightly tricky part is in converting a B2347 "gangster" (one of the 2865) into a randomly-chosen realisation of four mutually-compatible boxes. I can post details if anyone wants, or you can just wait for gsf's DVD to come out (please don't read that the wrong way!)
Red Ed

Posts: 633
Joined: 06 June 2005

Red Ed wrote:
gsf wrote:I'm trying to see how automorphisms come into play
they don't affect the
Code: Select all
`6.67124817229145899e+21 = (5472730538) * (9*8*7*6*5*4*3*2*1) * (6*6*6*6*6*6*6*6) * (2)`
grid counting
Actually yes they do affect the grid count. The formula is:
Code: Select all
`nr grids = 6670903752021072936960 = 5472730538  * 9! * 6^8 * 2 / A`
where A ~= 1.00005163 is the average number of automorphisms among all solution grids.

I knew I was missing something -- thanks for filling it in
so the number of grids for one grid with a automorphisms is (9! * 6^8 * 2 / a)

(10000 grids/sec is great)
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

Unbiased Grid Generation using Band Signatures

After a period of inactivity, I recently learned of the big system crash and was disappointed to find that all of my posts on the above topic were lost. This disaster, however, gives me an opportunity to present the topic in a more cohesive fashion.

Ed Russell was kind enough to test my method using his "Z-score" set of statistical diagnostics. I sent Ed the source code and a very big pre-computed data file. He generated 10,000,000 grids for analysis. The challenge of this thread was fulfilled when Ed declared the results to be unbiased. Regretably, these results were lost in the crash.

The `C' routines generate about 43,000 grids per second on my home computer (Intel Core 2 Quad CPU Q8200 @ 2.33GHz; 64-bit OS). A large data file (115,934 KB) is also needed.

Band-1 Generation

Band-1 is chosen from a precomputed set of 416 class representatives, where row-1 is fixed to be 1,2,3,4,5,6,7,8,9. Each representative (together with its "spins") covers a non-overlapping portion of the space of all possible bands. The class choices are not equally probable, but are directly proportional to the number of grid completions. See ideas presented in "Enumerating possible Sudoku grids" by Felgenhauer and Jarvis. Also, see "Summary of method and results."

Code: Select all
`  1 2 3  4 5 6  7 8 9   Sample band-1 class representative.  6 7 8  3 9 1  5 2 4  9 5 4  2 8 7  3 6 1`

Row Order Restrictions

To simplify generation, some row order restrictions are enforced. Let R1, ..., R9 represent the first digit in each of the nine rows, respectively. The restrictions: R1 < R2 < R3, R4 < R5 < R6, R7 < R8 < R9, and R1 < R4 < R7.

For the sample band-1, this is means that R4-R5-R6 must be one of the following ten choices: 234, 235, 237, 238, 245, 247, 248, 257, 258, 278.

Band Signature Defined

Band-2 is generated based upon band "signatures" and the number of grid completions for each signature. The "signature" of a band is a 9-long array of digits which appear in each column. For convenience, the three digits in each column are concatenated in ascending order and listed as a single quantity. The following is one of the many band-2 choices consistent with the above band-1 representative:
Code: Select all
`  2 6 9  5 4 3  8 1 7   Sample band-2.  3 4 7  8 1 2  6 9 5     8 1 5  6 7 9  4 3 2  Band-1 signature = { 169, 257, 348, 234, 589, 167, 357, 268, 149 }  Band-2 signature = { 238, 146, 579, 568, 147, 239, 468, 139, 257 }  Band-3 signature = { 457, 389, 126, 179, 236, 458, 129, 457, 368 }  (implied)`

Band Signatures and Grid Completions

For every band-1 choice, there are exactly 87808 band-2 signatures which follow the row order restrictions. File "Signature.bin" holds all 36,528,128 (= 416 * 87808) signatures, stored in compressed form with 26-bits per signature, for a total of 118,716,416 bytes.

Let's count the number of grid completions for all band-1/signature-2 combinations. For a particular combination, we know the implied band-3 signature and must count the number ways the puzzle can be finished which satisfy the row ordering restrictions. Some of you may already know that the number of completions is limited to sixteen values: 16, 20, 24, 26, 28, 30, 32, 36, 38, 44, 46, 48, 86, 96, 144 and 288.

Before storage, the signatures are sorted by class number (1-416) and number of completions, making subsequent probability calculations and look-up much simpler.

Band-2 and Band-3 Generation

Given our band-1 representative, according to a weighted distribution (grid completions), choose one of the 87808 band-2 signatures. The number of permutation triples for this signature is already known, having been previously computed and stored. Suppose you randomly choose the i'th one; now, begin constructing fills and stop when the i'th fill is reached. This is the most time-consuming routine.

Compute the band-3 signature and follow the same procedure. (Actually, band-2 and band-3 construction could be done in parallel -- if available.)

Spin and Re-label

The final step is to randomly interchange rows within bands, columns within chutes, row bands and column chutes. Digits are also randomly re-labeled.

This is the short-version of how it works. There were a number of interesting discoveries as the algortihm developed. More later.
holdout

Posts: 35
Joined: 30 August 2005
Location: Bowie, Maryland USA

### Re: Unbiased grid generation

fist decide whether to take a grid with nontrivial automorphism-group
if positive, then just look it up in the list

take a grid from one of the 306693 G-classes probability according to their calculated sizes
permute the mini-columns randomly (look-up table)
reject it if it is not minimal or has nontrivial automorphismgroup and start again
dukuso

Posts: 479
Joined: 25 June 2005

Compressing Band Signatures

In my last post, I said that: "File `Signature.bin' holds all 36,528,128 (= 416 * 87808) signatures, stored in compressed form with 26-bits per signature, for a total of 118,716,416 bytes." Oops, only 25 bits are needed.

We need to realize that:
1) Each signature contains redundant information. Only columns 1,2,4,5,7,8 need be saved since columns 3,6,9 can easily be re-computed.
2) Each signature is coupled with one of the 416 band-1 representatives. This means that, in a column, the 3 signature digits must be drawn from the 6 digits not used in that column of band-1.

Suppose column-1 of band-1 is {1,6,9}, implying that column-1 of band-2 signatures must be drawn from {2,3,4,5,7,8}. Specifically, the signature element must be one of 20 choices:
234, 235, 237, 238, 245, 247, 248, 257, 258, 278,
345, 347, 348, 357, 358, 378, 457, 458, 478, 578.
So, here is the trick: save the 0-up postion of the element, not the element itself. For example, if we with to save element {238}, we instead save '3'. Because of ordering restrictions, column-1 positions are in [0,9], while other column positions are in [0,19].

Using the following form, the maximum value for a compressed signature is a 25-bit quantity.
Signature = 3200000*P1 + 160000*P2 + 8000*P4 + 400*P5 + 20*P7 + P8
Minimum {Signature} = 0
Maximum {Signature} = 31999999 = 172043777 (octal)
holdout

Posts: 35
Joined: 30 August 2005
Location: Bowie, Maryland USA

### Re: Unbiased grid generation

My table is about 330KB uncompressed. It consists of lines like this:
Code: Select all
`2019 b347 30667 cdf 385225243614720 ext 107 200 391 573 686 809 940 1102 1236 1444 1604 1767 1906 2188 2302 2642`
Not sure what the memory footprint is when read-in to my generator's data structures, but it will be pretty small.
Red Ed

Posts: 633
Joined: 06 June 2005

### Re: Unbiased grid generation

To RedEd: I have never been happy with the amount file space my method uses. I did try to find ways to avoid storing so many signatures, but was not clever enough to find sufficient isomorphisms or an on-the-fly generation method. I do believe the compression scheme is distinctive enough to warrant a note. If I had orginally noted that only 25 bits per signature are needed, file space would have been 11475 KB (= (25/26)*115934 KB), only marginally smaller.

Question: In your method, do you input (from a file) your entire table into main memory once, or access portions as needed? Of necessity, I have one binary read (4 bytes) per grid.
holdout

Posts: 35
Joined: 30 August 2005
Location: Bowie, Maryland USA

Band Signatures -- Counting by Proxy

Extensive counting of grid completions was necessary to calculate signature probabilities. Suppose band-1 has been chosen and we are now iterating thru choices for band-2 signatures. For each choice we must count the number of ways in which band-3 can be completed. If this done by actually constructing the possibilities, it is very expensive.

A better way is to use the band-3 signature to compute some inexpensive statistic, and then correlate the statistic to the known answer. The trick is to find a statistic which correlates 100% of the time. The desired statistic must be invariant to chute swaps and column swaps within a chute.

An early attempt to produce such a statistic worked great! The statistic took on 41 different values and these values correlated with 100% accuracy to the 16 grid completion choices.
Code: Select all
`stat = Sum    { Sum     { Sum    (n = Hits[i,j] * Hits[i,k] * Hits[j,k]) * (n-1) } } }       i=1,2,3  j=4,5,6   k=7,8,9     where Hits[x,y] is the number of common digits in columns `x' and `y'                     (0 <= Hits[x,y] <= 3)`

The following `C' code implements this algorithm.

Code: Select all
`/*******************************************************************************            Return Number of Configurations of Permutation Triples,                   given the Signature of a Row BandNote: Three of the nine low-order bits of each signature[i] are set to 1,      one bit for each value in the column.*******************************************************************************/int SignatureTriples (int signature[9]){  int SigStatTriples[41] = {    20, 20, 28, 26, 24, 28, 30, 38, 32,  28, 36, 16,  38, 46, // 13    20, 24, 28, 28, 28, 24, 48, 44, 86,  24, 32, 16,  32, 16, // 27    48, 48, 32, 32, 32, 32, 32, 32, 32, 144, 96, 96, 288 };   // 40  int SigStatList[41] = {     54,  66,  72,  82,  84,  86,  92, 100, 102, 104, 106, 108, 114, 128, // 13    130, 138, 142, 150, 152, 160, 168, 178, 180, 192, 194, 196, 202, 222, // 27    248, 262, 280, 286, 336, 368, 372, 410, 504, 880, 936,1092,2268 };    // 40  // Other declarations  static int t[1108] = {0};  int stat, i, j, k, n;  // Onetime code: ompute look-up table.  if ( t[0] == 0 )  { for ( i=1; i < 1108; i++ ) t[i] = 0;    for ( i=0; i < 41; i++ ) t[ (SigStatList[i]-54) >> 1 ] = SigStatTriples[i];  }  // Compute statistic  for ( stat=i=0; i < 3; i++ )  { for ( j=3; j < 6; j++ )    { for ( k=6; k < 9; k++ )      { n = PopCount[ signature[i] & signature[j] ] *       PopCount[ signature[i] & signature[k] ] *       PopCount[ signature[j] & signature[k] ] ;        stat += n*(n+1);      }    }  }  // Error check  if ( stat < 54 || stat > 2268 || t[ n=(stat-54)>>1 ] == 0 )  { fprintf(stderr,"*** SignatureTriples(): Fatal error.\n");    fprintf(stderr,"    Statistic \"%d\" not found in list.\n",stat);    exit(1);  }  // Return number of configurations  return( t[n] );} `
holdout

Posts: 35
Joined: 30 August 2005
Location: Bowie, Maryland USA

### Re: Unbiased grid generation

holdout wrote:In your method, do you input (from a file) your entire table into main memory once, or access portions as needed?
The former.
Red Ed

Posts: 633
Joined: 06 June 2005

Previous