All grids are equal but some are more equal than others !

Everything about Sudoku that doesn't fit in one of the other sections

All grids are equal but some are more equal than others !

Postby coloin » Fri May 12, 2006 4:25 pm

A previous discussion has been concerning the 4x4 sudoku.....and the number of puzzles has been calculated.
http://forum.enjoysudoku.com/viewtopic.php?t=3351&postdays=0&postorder=asc&start=15
sq wrote:Just to round off the discussion; Shi Doku has

288 solutions of which 2 are essentially different
13579680 puzzles of which 85632 are irreducible
1536 6-clue, 58368 5-clue and 25728 4-clue irreducible puzzles
4710 essentially different puzzles of which 36 are irreducible (13 4-clues, 22 5-clues and 1 6-clue)

What about the 3x3 ?
Just how many minimal puzzles are there in a grid ?
I dont think anyone has come up with a sensible estimate - quite rightly !
The generation of of 1 million minimal sudoku puzzles has been performed for several grids and the clue statisics have been shown
dukuso wrote:
Code: Select all
counts of clues in randomly generated minimal
sudokus over sf :
average:24.104503 , 1000000 samples
19 3
20 229
21 6277
22 61494
23 227035
24 352839
25 248868
26 86121
27 15453
28 1589
29 89
30 3 

Concerns over the proportions must be made due to the randomization/computation methods which produces the puzzles.
1.common size 24 clue may well be over represented / rarer sizes under.
2.duplication of puzzles by a quirk in the randomization

If the above could be excluded, and this is a big if, if we find the average seed value at which puzzles start to get duplicated - this would be proportional to the overall [finite] number of puzzles in that particular grid.
The 24 clue puzzle is the commonest
If the seed value that we reach a single duplicate is on average 50,000 then perhaps this could be extrapolatable that there are 900 million puzzles of that size. [Complete guesswork on the figures here !]
I have a hunch that certain grids have more puzzles than others - e.g the grids which have a low minimal - specifically the SF grid.

Certainly - if you generate enough puzzles you will start to get duplicates - the less puzzles there actually are - the sooner you will get duplicates.

Any thoughts into whether this is practical - or possible ?

see http://mathforum.org/dr.math/faq/faq.birthdayprob.html
There is a 50% chance that 2 out of 23 people will share the same birthday - therefore number of days in the year = 365 !
C
Apologies to G.Orwell
coloin
 
Posts: 2502
Joined: 05 May 2005
Location: Devon

Postby r.e.s. » Fri May 12, 2006 10:28 pm

(OT reply here)
r.e.s.
 
Posts: 337
Joined: 31 August 2005

Re: All grids are equal but some are more equal than others

Postby JPF » Sun May 14, 2006 11:33 pm

coloin wrote:... Certainly - if you generate enough puzzles you will start to get duplicates - the less puzzles there actually are - the sooner you will get duplicates...

If I understand what you want to do, the process would be, for a given grid G :

1) Pick randomly a minimal puzzle P included in G
2) Check if the puzzle P has already been picked.
2.1) if yes, note the number k of selected minimal puzzles
2.2) if no, go to 1)


Let's assume that the number of minimal puzzles included in G is N.
The first occurrence of 2 identical minimal puzzles comes after k trials (2<=k<=N+1)

Let's call X the random variable related to k.

The probability P(N,k) of X=k is:
P(N,k)=(k-1)*N!/[(N^k)*(N-k+1)!]

The expected value E(X) is a function H(N) of N :

E(X)=2*P(N,2)+3*P(N,3)+...+N*P(N,N)+(N+1)*P(N,N+1)
H(N)= N!*SUM [k*(k-1)/((N^k)*(N-k+1)!) ; k=2,3,...,N+1]

May be there is an analytical expression of the function H.
Here are the first values of H :

H(2)=2.5
H(10)=4.7
H(100)=13.2

I don't have the tools to go further.

An estimation u of E(X) can then be found by repeating the process 1000 times (?) and calculating the mean of the 1000 values of k.

N is then estimated by using u=H(N)

No idea on N (depends of the grid, of course)
No idea on the order of magnitude of H(10^9).

Hope that could help.

JPF

PS : To me, step 1) is not easy.
JPF
2017 Supporter
 
Posts: 6139
Joined: 06 December 2005
Location: Paris, France

Re: All grids are equal but some are more equal than others

Postby r.e.s. » Mon May 15, 2006 3:14 am

JPF wrote:The probability P(N,k) of X=k is:
P(N,k)=(k-1)*N!/[(N^k)*(N-k+1)!]

Coincidentally, several years ago I posted <here> what I'd found to be the exact maximum-likelihood estimator for N:

N*(2) = 1
N*(k) = floor((1/2)*(k-1)^2 + (1/6)*(k-1)), k=3,4,5,...

which is essentially quadratic in k.
r.e.s.
 
Posts: 337
Joined: 31 August 2005

Re: All grids are equal but some are more equal than others

Postby JPF » Mon May 15, 2006 7:34 am

r.e.s wrote:what I'd found to be the exact maximum-likelihood estimator for N:

N*(2) = 1
N*(k) = floor((1/2)*(k-1)^2 + (1/6)*(k-1)), k=3,4,5,...

which is essentially quadratic in k.

Great !

Can we safely say that if N is large enough (N=10^n ; n>=6 for instance), the expected value (k*) of k is roughly :
k*=SQRT(2*N) ?

In such case N = 10^9 => k*= 44700, which gives some interest to the proposed technique for estimating N for a grid G.

JPF
JPF
2017 Supporter
 
Posts: 6139
Joined: 06 December 2005
Location: Paris, France

Postby coloin » Mon May 15, 2006 9:58 am

Thank you for your work on this.....
I knew it was complicated !

r.e.s. has calculated that if the average number of grids produced before we get a duplicate is 10,000 then the total number of grids is around 72 million. And what coincidence that r.e.s has done similar work !

JPFs comments are on the track too - duplicate at 44700.........N = 10^9

I have generated grids over the weekend.

I chose the canon grid as it has the highest proportion of large puzzles....[although I think this may be artifactual as there is over 500-fold puzzle-isomorphism in the grid] [I also suspect there are less total number of grids]

I got the following reults for 30 000 000 generated puzzles
Code: Select all
total greater than 28  220867
29   206677
30   13769
31   416
32   5


Give me a few days to look at them and see if there are any [?] duplicates !

Do you know the command in excel [or other spreadsheet which can handle more lines] to find duplicate entries ?

C
coloin
 
Posts: 2502
Joined: 05 May 2005
Location: Devon

Postby JPF » Mon May 15, 2006 12:33 pm

coloin wrote:...r.e.s. has calculated that if the average number of grids produced before we get a duplicate is 10,000 then the total number of grids is around 72 million...

I’m not sure we can use r.e.s. post like that.

With k=10,000 in average, the M-L. estimator of N ~1/2x(10,000)^2=50x10^6 which is ...
anyway not far from 72x10^6.

JPF
JPF
2017 Supporter
 
Posts: 6139
Joined: 06 December 2005
Location: Paris, France

Postby r.e.s. » Tue May 16, 2006 6:33 pm

JPF wrote:Can we safely say that if N is large enough (N=10^n ; n>=6 for instance), the expected value (k*) of k is roughly :
k*=SQRT(2*N) ?

That's a good question. "Safely" is relative, of course, and I haven't done any analysis to get an idea of the error that that would involve. Still, ~sqrt(2N) would be my best guess, if I had to make one by that approach (but see below).

coloin wrote:r.e.s. has calculated that if the average number of grids produced before we get a duplicate is 10,000 then the total number of grids is around 72 million.

Not quite ... but the median number of different individuals showing up more than once in a random sample of size 10000 from a population of size N is about 1 when N=72million. As JPF noted, if we sample only until the first duplicate occurs (say on trial k), then the maximum-likelihood estimate I gave for N when k=10000 is about (10000^2)/2 = 50million.
--
JPF's ealier point about the difficulty of ensuring a truly random sample is probably paramount; but, if anyone is really serious about considering a sampling approach to the problem of estimating N, here are some suggestions (which also suit the method of sampling that coloin seems to intend) ...

Instead of sampling until the first duplicate is observed, fix a sample size (say n) and then take a random sample of that size. The population size N can then be estimated in terms of the observed "frequencies of frequencies" that occur in the sample (the y_i, explained below). This is not a widely-known approach, but I posted some details about it quite a few years ago in a sci.math thread titled "Birthday problem in reverse" -- Googling still finds it.

Translating the main results from there to the present problem ...
Code: Select all

 E(y_i) = C(n,i) * (1/N)^(i-1) * (1-1/N)^(n-i)   (1 <= i <= n)

where y_i is the number of different individuals showing up exactly i times in a random sample of size n from a population of N different individuals, and C() is a binomial coefficient.

In particular, letting y_sum =  y_1 + ... + y_n (which is the number of distinct individuals observed in the sample),

 E(y_sum) = N * (1 - (1-1/N)^n)

so a method-of-moments estimator for N is obtained as a function of (n, y_sum) by solving for N_mm in 

 y_sum = N_mm * (1 - (1-1/N_mm)^n).

An approximation that's good when n << N is 

 N_mm ~ n(n-1)/(2*(n-y_sum))

(Note that y_sum = n-1 corresponds to a sample with exactly one duplicate, in which case N_mm ~ n(n-1)/2, consistent with the essentially quadratic maximum-likelihood esimator N* that I posted earlier.)

Here's a trivial example (but one I actually just now generated with a pseudo-random sample) to help clarify the method ...

Code: Select all
Population: an alphabet of size 62 (0-9A-Za-z)
Random sample of size n=30:  mxSCcQ9tX5sWf1nKtcLmPhLJN5CTs2
Sorted and grouped:   1 2 55 9 CC J K LL N P Q S T W X cc f h mm n ss tt x
y_1 = 16 (1 2 9 J K N P Q S T W X f h n x)
y_2 = 7  (55 CC LL cc mm ss tt)
y_3 = ... = y_30 = 0
y_sum = 16+7 = 23
N_mm = 52.36...  (by numerical solution of 23 = N_mm * (1 - (1-1/N_mm)^30))

The poor performance (62 estimated by about 52) reflects the small sample size.  The approximation formula isn't supposed to apply, because we don't have n << N, but apparently by a quirk it leads to a better estimate: 30(30-1)/(2(30-23)) = 62.1...
r.e.s.
 
Posts: 337
Joined: 31 August 2005

Postby coloin » Tue May 16, 2006 8:09 pm

Thankyou r.e.s - I think you are right

I think the spread of the occurance of a first duplicate is going to be so great to make finding a proper "average" - median or mean - difficult and unreliable.

However the second approach of the "frequencies of frequencies" that you outlined is going to be more productive. In fact I was struggling to diffenciate between the methods in my mind when I posted the original post. I was not quite sure how small the error would be when generating several duplicates.

Getting back to the topic and its possible use .......? finding a grid with a high number of large clued puzzles.

I am having trouble finding ANY duplicates in the supposed less frequent sizes of puzzles......I think I may go back to the program maker and see if he can program an analytical tool.

In analysis of a few grids here are the relative frequencies of minimal clue generation for a few grids - per miilion generated.
Note we have no idea of the absolute values.

Code: Select all
Known puzzle size                                                                            Min.  Max.
Random7   856923714731456298492817356348291675925764183167385429273548961684179532519632847  19      ?
Canon     123456789789123456456789123231564897564897231897231564312645978645978312978312645  19     35
SF        639241785284765193517983624123857946796432851458619237342178569861594372975326418  17     34
Coloinmax 126347598458169732379285461213478659584692317697513824732851946845926173961734285  20     35


Code: Select all
                                            Random7    Canon        SF    Coloinmax   

Number of 28 clue puzzles per million         2414     50751      1554       4084       
Number of 29 clue puzzles per million          120      6735        95        271         
Number of 30 clue puzzles per million            6       466         1         10               



Perhaps the canonical grid is way out in front when it comes to the number of high clue minimals per million, but I think a grid like the SF might have many more minimal puzzles......

Looking for a duplicate .....

EDIT
from 1000000 generated grids from the canon grid [slightly different due to randomization]
Code: Select all
puzzle size   puzzles in sample   duplicated puzzles in sample
   28              50796                               0
   29               6787                               0
   30                455                               0
   31                 16                               0

Which means at the 28 clue level there may be at least 10^9 puzzles

EDIT
Well this perhaps isnt surprising ! We were able to make over 110 different nonisomorphic puzzles of size 33 - and there are many more than this.
http://forum.enjoysudoku.com/viewtopic.php?t=1448&postdays=0&postorder=asc&start=15
This means there are at least 110* 648 puzzles = 71280 puzzles of size 33. I now have no idea how many puzzles there are per grid !

EDIT I have found over 2000 34s in a single grid - and there may well be many more.......maybe this is not surprizing.......some estimates are of 1*e^24 for the number of minimal puzzles per grid [not sure how this is estimated]
C
coloin
 
Posts: 2502
Joined: 05 May 2005
Location: Devon


Return to General

cron