The New Sudoku Players' Forum

by **Red Ed** » Mon Jan 01, 2007 11:38 pm

coloin wrote:Your penultimate grid ... appears to be unique for now in that it does not have any U6s.

A random search seems to turn up U6-free grids once every 150000-or-so solutions, so they're rare but not super-rare. (I may have made a mistake with my program though, so not sure if we can really trust this figure.)

I hope to send JPF a rather larger sample of grids generated by dukuso's program - to see how near/far off the mark his program is.

Here's what my code says about suexg. Seems consistent with JPF's E(X) value. In the table below, bias and Z-score are as described in my previous post, and Z_1M is the Z-score scaled to 1 million grids (which is the sample size in this case anyway, so no different to the Z-score itself).

Code: Select all: +-----------------------------+----------+-----------+----------+ | Pattern | Bias | Z-score | Z_1M | +-----------------------------+----------+-----------+----------+ | .....2..8..............8..2 | -7.86% | -105.38 | -105.38 | | .87...1.22.........6124875. | -7.58% | -103.26 | -103.26 | | .34.9.87..5......289..35.6. | -7.58% | -103.26 | -103.26 | | 6.2..4....79..8.65.58..1..7 | -8.08% | -101.94 | -101.94 | | 3.1..975..4..6...3.97.2.1.. | -8.07% | -101.42 | -101.42 | | 318..79...4......77.59.3.2. | -8.64% | -100.93 | -100.93 | | 218.37...39..24..7.......8. | -8.13% | -100.21 | -100.21 | | ..2.9.1.....4....2..9.21... | -7.07% | -99.89 | -99.89 | | 6...8..7...87.2.1.7..3...6. | -8.37% | -98.74 | -98.74 | | .7.93.2...82.6.....9.7.1..3 | -7.04% | -98.60 | -98.60 | | .9..3.586.34.172..7.2...4.. | -8.47% | -98.08 | -98.08 | | .1.5...6..6.....1.....6.2.5 | -8.07% | -97.27 | -97.27 | | .14..7...5....268.29.6.5.7. | -8.96% | -97.01 | -97.01 | | 578.613...3....764.2..9..1. | -8.41% | -96.98 | -96.98 | | .8..745.15...1.2.797.6...34 | -9.24% | -96.60 | -96.60 | | 7132..65......8.41..65.9.2. | -9.23% | -96.54 | -96.54 | | 8..1...4.19.8..6.23.2.9...7 | -8.20% | -95.67 | -95.67 | | 1.3....869..4..21.2..7..94. | -7.73% | -95.14 | -95.14 | | ..73...454...7..6..351..9.7 | -9.03% | -95.12 | -95.12 | | ..5.2.4.11...453...7.91..82 | -8.39% | -94.59 | -94.59 | +-----------------------------+----------+-----------+----------+

Interesting that you/JPF have tabulated the distribution of the number of U4s. When I have more time, I'll do the same for 1 million unbiased grid samples.

by **gsf** » Tue Jan 02, 2007 7:26 am

Red Ed wrote:what it's saying is that the size-4 unavoidable ...
Code: Select all
...|..2|..8 ...|...|... ...|..8|..2
... occurs far less frequently than expected.

before I dig into the randomizing logic could you check my solver for a size
4 unavoidable pattern with two low digits, say 1,2 or 2,3
there is one loop with a low-high digit bias, so I would expect 1,2 to be on
the other end of 2,8, and also 8,9 should be worst on the other end

what is the expected frequency for this pattern per 1M grids?

by **Red Ed** » Tue Jan 02, 2007 12:05 pm

Expected frequency is 11.57888 per grid.

Here is the bias (for your current code) per digit-pair, with (1,2) at the top left of the table, (9,8) at the botom right:

Code: Select all: -8.68% -8.22% -8.26% -8.45% -7.70% -8.06% -7.92% -8.01% -8.68% -8.29% -8.02% -8.27% -7.97% -7.98% -7.46% -7.52% -8.22% -8.29% -7.98% -7.86% -7.85% -7.71% -7.41% -7.61% -8.26% -8.02% -7.98% -7.72% -8.02% -7.56% -7.90% -7.37% -8.45% -8.27% -7.86% -7.72% -7.54% -7.61% -7.37% -7.28% -7.70% -7.97% -7.85% -8.02% -7.54% -7.62% -7.39% -7.14% -8.06% -7.98% -7.71% -7.56% -7.61% -7.62% -7.40% -7.40% -7.92% -7.46% -7.41% -7.90% -7.37% -7.39% -7.40% -7.00% -8.01% -7.52% -7.61% -7.37% -7.28% -7.14% -7.40% -7.00%

So, yes, low pairs are slightly worse than high pairs.

by **JPF** » Tue Jan 02, 2007 12:24 pm

Red Ed wrote:OK, what do I really mean by that? My testing currently works as follows. I've got a library of ~3000 single-band clue patterns chosen sort-of randomly. For each pattern, p, I know the mean, mean(p), and an upper bound, sd(p), on the standard deviation of the number of those patterns that exist in a perfectly randomly chosen solution grid. Given a test file of n (default: 1000000) solution grids generated by some algorithm, I count the number of patterns of each type, count(p). The %bias for pattern p is then 100 * ( count(p)/(n*mean(p)) - 1 ) and the Z-score, i.e. the corresponding point of the Normal distribution, is ( count(p)/sqrt(n) - sqrt(n)*mean(p) ) / sd(p).

In order to make a test on my own grids-generator, could you give me the exact values for mean(p) and sd(p) for each of the 20 bands patterns you have selected.

Thanks.

JPF

by **Red Ed** » Tue Jan 02, 2007 1:49 pm

JPF, I can't give you those figures easily because my program doesn't work exactly as I described. It effectively does what I said, but the implementation actually counts the number of bands/stacks in each of the 416 isomorphism classes and deduces equally-scaled versions of count(p), mean(p) and sd(p) from there. (If anyone cares, I can describe that more fully some time.)

I was intending to put my code + patterns file somewhere that's accessible to all. Suggestions would be appreciated for suitable file-sharing sites. But that won't happen for at least the next couple of weeks, so in the interim I can't help. Sorry.

gsf, a clarification: when I said there were 11.57888 of those patterns per grid, I mean up to isomorphism. Put another way: you expect 11.57888 "unique rectangles" (yuck!) of one orientation or another per grid. The bias that I reported was for all URs, not just band-URs on digits 2/8.

by **Red Ed** » Tue Jan 02, 2007 4:39 pm

in response to coloin, I wrote:Interesting that you/JPF have tabulated the distribution of the number of U4s. When I have more time, I'll do the same for 1 million unbiased grid samples.

As promised, from 1 million suexg and unbiased samples:

Code: Select all: #X suexg unbiased -- ----- ------- 0 113 65 1 618 297 2 2439 1282 3 7069 3724 4 15975 9568 5 30670 19497 6 50533 34055 7 72242 52886 8 93204 73819 9 109003 92298 10 114851 106034 11 112922 112027 12 102720 109149 13 85591 99187 14 67966 83836 15 49713 66808 16 34222 49319 17 21822 34402 18 13425 22560 19 7579 13615 20 3849 7805 21 1999 4205 22 883 2027 23 345 903 24 160 408 25 61 142 26 17 54 27 6 17 28 0 9 29 2 1 30 1 1 31 0 0 32 0 0 33 0 0 34 0 0 35 0 0 36 0 0

I don't know what the true probabilities are for each #X, but they should be close to the values implied by the righthand column.

by **JPF** » Wed Jan 03, 2007 8:00 pm

suexg :
E(X) = 10.67 is the same number we got with 500000 grids.
which is OK :
|m-E(X)|< 2s/sqrt(n) [P = 95%]
as s~3.4 ; n = 500000 is large enough to have an absolute error ~ 0.01

Unbiased generator :
11.570 ≤ mean(X) ≤ 11.584 [P=95%] which is consistent with 11.5788

Back to the challenge :

Red Ed wrote: Challenge: given a perfect(*) random number generator, can you produce a perfect random sudoku grid generator
(*: perfect = no bias, correlations etc.)

Some thoughts :

A perfect random sudoku grid generator would consist of the followings :
-The list of the set Sg of the different grids. |Sg|= Ng = 6670903752021072936960 = 6.670904 E21
-A perfect uniform random integer generator giving an integer between 1 and Ng.

As it seems impossible to get (and to store) the Ng grids, one could limit the list to a set Se of essentially different Sudoku grids.
|Se|= Ne = 5472730538
Let G be a random grid from Se.
Let f be a random transformation of the symmetry group :
f is generated by a combination of independent random transformations r, t, p
where :
r is a relabelling (9!)
t a transposition (2)
p a permutation (6 x 6 x 6^3 x 6^3)
By using r, t, p it's easy to generate an uniform random f among the nt = 9! x 2 x 6^8 = 1.218998 E12 transformations.

The collection Sf of f(G), when G describes Se, includes Sg.
The collection Sf has duplicates.
|Sf| = 9! x 2 x 6^8 x Ne = 6.671248 E21 > Ng

The relative error (bias ?) is therefore e = (|Sf| - Ng) / Ng = 5.16 E-05.

Example : Estimation of P = proportion of valid puzzles.
Let P' be the proportion of valid puzzles in the "duplicate" grids.

We will observe P* instead of P, such that :
P* = (1 - e)P+ eP'
and
|P*- P|= e| P'- P|< 5.16 E-05 [as P and P' <1].

But is it easy to build a list of Se ?
With 10 bytes/grid (see RW) we would need around 55 x 10^9 bytes before any compression to store the file.

JPF

by **Red Ed** » Wed Jan 03, 2007 8:48 pm

JPF, I like your idea of accepting a small bias (essentially by pretending that no automorphic grids exist, i.e. that Sf(G) has no duplicates) in order to cut the problem down to a more manageable size. I hope I understood that correctly.

But I'm still interested in perfect generators and would like to beat my very slow 10 grids/sec method. I think I know how, and that's by using the tricks of the "B2347" solutions-counting method. Basically: pick B2347 with appropriate weight (easy with a lookup into a 2865-long table), enumerate all B1 solutions up to a random stopping point, enumerate all B5689 solutions up to a random stopping point, apply a random validity-preserving group op. I hope to get 1000+ grids/sec from that, though I've yet to prove, code or test the method so I could be talking out of my hat ...

by **gsf** » Thu Jan 04, 2007 9:59 am

Ocean wrote:Illustration of a biased grid generator: Pick randomly an empty cell; Assign randomly a valid digit to that cell; (repeat until grid is filled).

"Every time (*)" this is reached:
(*) (happens less than once per life span of our galaxy on my computer)

..4.56789.57.89246689247135..3564897948723561765891423591438672432675918876912354

there are three possible resulting grids:

124356789357189246689247135213564897948723561765891423591438672432675918876912354 # p=26/168
214356789357189246689247135123564897948723561765891423591438672432675918876912354 # p=71/168
324156789157389246689247135213564897948723561765891423591438672432675918876912354 # p=71/168

The chance of getting to the first grid is ~16%, while for each of the other two ~42%.
So, it is suspected (expected) that certain grid types will be overrepresented compared to others, when using this particular method for grid generation.
#

I'm rereading the thread from the top and got stuck here
how are the { 26 71 168 } derived?
if you look at the pm grid for the example

Code: Select all: 123 12 4 | 13 5 6 | 7 8 9 13 5 7 | 13 8 9 | 2 4 6 6 8 9 | 2 4 7 | 1 3 5 ------------+------------+------------ 12 12 3 | 5 6 4 | 8 9 7 9 4 8 | 7 2 3 | 5 6 1 7 6 5 | 8 9 1 | 4 2 3 ------------+------------+------------ 5 9 1 | 4 3 8 | 6 7 2 4 3 2 | 6 7 5 | 9 1 8 8 7 6 | 9 1 2 | 3 5 4

there are only 15 choices, each of which fixes the remaining choices
the choices are distributed over 7 cells
call the 3 possible grids {1 2 3 } based on the cell [11] assignment
the choices, in row order, result in these grids

Code: Select all: { 1 2 3 } { 2 1 } { 3 1 } { 3 1 } { 1 3 } { 2 1 } { 1 2 }

so the probability of getting each grid is

Code: Select all: 1: (1/7)*(1/3) + (6/7)*(1/2) : (2 + 18) / (7 * 3 * 2) : 20 / 42 : 48% 2: (1/7)*(1/3) + (3/7)*(1/2) : (2 + 9) / (7 * 3 * 2) : 11 / 42 : 26% 3: (1/7)*(1/3) + (3/7)*(1/2) : (2 + 9) / (7 * 3 * 2) : 11 / 42 : 26%

as grids get filled in the remaining cell independence decreases
to the point, as this example illustrates, where all cells are dependent
its this creeping interdependence that bothers me about the averaging in the unbiased analysis

also, wouldn't a random grid generator be biased if it didn't return grid 1 48% of the time?

(or maybe all those stats classes I skipped have finally come home to roost)

by **Ocean** » Thu Jan 04, 2007 1:29 pm

gsf wrote:there are only 15 choices, each of which fixes the remaining choices
the choices are distributed over 7 cells

Thanks for checking (and 'correcting') the numbers. - Although the above statement must be wrong: only nine of those 15 choices will fix the grid, while the remaining six choices leave the grid unfixed, which leads to a second choice. My earlier 'back-of-an-envelope' calculations might have been too hasty, but here is what I get now:

Code: Select all: 1: (1/7)*(1/3) + (6/7)*(1/2)*(1/2) : (4+18)/(7*3*2*2) : 22/84 : 26% 2: (1/7)*(1/3) + (3/7)*(1/2) + (3/7)*(1/2)*(1/2) : (4+18+9)/(7*3*2*2) : 31/84 : 37% 3: (1/7)*(1/3) + (3/7)*(1/2) + (3/7)*(1/2)*(1/2) : (4+18+9)/(7*3*2*2) : 31/84 : 37%

gsf wrote:as grids get filled in the remaining cell independence decreases
to the point, as this example illustrates, where all cells are dependent
its this creeping interdependence that bothers me about the averaging in the unbiased analysis

also, wouldn't a random grid generator be biased if it didn't return grid 1 48% of the time?

Have to think this over, but maybe someone else has an answer ready.

by **gsf** » Thu Jan 04, 2007 4:00 pm

Ocean wrote:
gsf wrote:there are only 15 choices, each of which fixes the remaining choices
the choices are distributed over 7 cells

Thanks for checking (and 'correcting') the numbers. - Although the above statement must be wrong: only nine of those 15 choices will fix the grid, while the remaining six choices leave the grid unfixed, which leads to a second choice. My earlier 'back-of-an-envelope' calculations might have been too hasty, but here is what I get now:

Code: Select all
1: (1/7)*(1/3) + (6/7)*(1/2)*(1/2) : (4+18)/(7*3*2*2) : 22/84 : 26% 2: (1/7)*(1/3) + (3/7)*(1/2) + (3/7)*(1/2)*(1/2) : (4+18+9)/(7*3*2*2) : 31/84 : 37% 3: (1/7)*(1/3) + (3/7)*(1/2) + (3/7)*(1/2)*(1/2) : (4+18+9)/(7*3*2*2) : 31/84 : 37%

and the same thanks back to you

Ocean wrote:
gsf wrote:as grids get filled in the remaining cell independence decreases
to the point, as this example illustrates, where all cells are dependent
its this creeping interdependence that bothers me about the averaging in the unbiased analysis

also, wouldn't a random grid generator be biased if it didn't return grid 1 48% of the time?

Have to think this over, but maybe someone else has an answer ready.

well typing in the previous post, hitting submit, and then sleeping, cleared my head a bit

I was looking at this thread strictly from the viewpoint of a generator looking for a single solution grid
and being unbiased in cell assigments, which is completely different from being unbiased
in selecting puzzles from the set of non-isomorphic solution grids

your example (on third read) clears that up and also shows the connection to red ed's initial post

so
what if the random generator (as laid out by ocean) keeps a running average of U4s
of previously generated puzzles and biases some of its random cell assignments to
push the average to the expected value -- would this steer the generator to a better
or worse stream of grids?

by **Ocean** » Thu Jan 04, 2007 8:35 pm

gsf wrote:so
what if the random generator (as laid out by ocean) keeps a running average of U4s
of previously generated puzzles and biases some of its random cell assignments to
push the average to the expected value -- would this steer the generator to a better
or worse stream of grids?

The UR4s were used as an example of how bias may occur. But the same goes for UR4+UR6, 2xUR6, or in general a collection of unavoidable sets of any size.

Similarly, the (bias of) number of UR4s in generated grids is an indicator of biased grid generation. But correcting an indicator is clearly no fix of the problem itself.

Maybe as an approximation strategy - who knows?
(As calculators use approximative linear algorithms with known error bounds instead of perfect algorithms for certain functions: Maybe fast approximative algorithms for grid generation can be developed, where the bias is negligible for all practical purposes.)

by **gsf** » Thu Jan 04, 2007 8:45 pm

Ocean wrote:Maybe as an approximation strategy - who knows?
(As calculators use approximative linear algorithms with known error bounds instead of perfect algorithms for certain functions: Maybe fast approximative algorithms for grid generation can be developed, where the bias is negligible for all practical purposes.)

that's what I was getting at

by **holdout** » Thu Jan 18, 2007 10:26 pm

I've read you posting several times, and on about the 4th read had a little chuckle:

Red Ed wrote:An obvious speed improvement would be to do more up-front work.

I finally understood the "big idea" and now realize that the amount of one-time work you have already done is substantial. The programming is non-trivial and the math is excellent.

Do you think that starting with four digits is a realizable goal? I will be replicating some of your work just to see how far I can get. Good job.

by **gsf** » Thu Jan 18, 2007 10:52 pm

based on a thread in the programmers forum I updated (an reposted) my solver to use
rookery based generation and it sped solution grid generation by ~3X along with reducing the bias
it now generates ~5200 grids/sec/Ghz

The New Sudoku Players' Forum

Unbiased grid generation

Re: Illustration of biased grid generation.

Re: Unbiased grid generation

Re: Unbiased grid generation