The real distribution of minimal puzzles

Everything about Sudoku that doesn't fit in one of the other sections

Re: The real distribution of minimal puzzles

Postby denis_berthier » Mon Jun 03, 2013 5:37 am




How to compute unbiased statistics


Considering some recent discussions about unbiased statistics in the "grey zone" and the "JExocet" threads, it seems that some of the results reached in this old lost thread and some of their obvious consequences have been forgotten.
For a summary of the essential results of this thread, see the references below.

Note: the following applies to the whole collection of minimal puzzles. It can obviously not be extended to a restricted collection defined by particular properties, such as having a rating larger than some predefined value (wrt to some predefined rating) or having some predefined pattern.


Main result

The main result of this thread was, we now have a close estimate for the real distribution of minimal puzzles wrt to their number of clues (as a result, we also have a close estimate for the total number of minimal puzzles). Let me recall the distribution here for ease of use.


Code: Select all
#clues      %puzzles 
20          1.32E-7       
21          0.000034
22          0.00348
23          0.148
24          2.28
25          13.42
26          31.91
27          32.71
28          15.48
29          3.598
30          0.41
31          0.0241
32         0.00102

these results are valid with overall relative precision  0.065%

see page 43 of the pdf copies of the old forum for more detailed results


Puzzles outside the [20, 31] range can be neglected in any statistical study bearing on the whole collection of minimal puzzles (e.g. the frequency of puzzles having some pattern).

Let pk be the proportion of puzzles with k clues, according to this estimate.


Practical applications

As of today, we are still unable to generate unbiased uncorrelated collections of puzzles.
The best we can generate remains uncorrelated collections with controlled-bias (this is how the above results were obtained), and such generation remains very time consuming (about 257,000 times longer than top-down generation).

However, do we need a controlled-bias collection to produce unbiased statistics? NO. The above results are valid once and for all and they can be used as such.
Of course, the closer to unbiased the better and controlled-bias is the closest we currently have.
But suppose we have another uncorrelated collection, generated e.g. by a top-down generator (whose bias is much larger than the controlled-bias) or a bottom-up one (whose bias is still much larger).

Suppose we are interested in some random variable X, e.g.:
- X = 1 if the puzzle has an sk-loop, 0 otherwise
- X = 1 if the puzzle has a J-Exocet, 0 otherwise
- X = number of J-Exocets in the puzzle
- X = number of eliminations by J-Exocets in the puzzle after some set of other rules (e.g. SSTS) has been applied
and suppose we want to estimate the real mean of X.

If, instead of simply taking the mean for all the puzzles in the collection, we use a weighted mean (using the pk as weights), we get an estimate of the unbiased mean value of X for the minimal puzzles. More precisely, let E(Xk) be the mean value of X for the puzzles with k clues in the collection. Then the unbiased estimate of the mean E(X) of X is merely:

E(X) = E(X1)*p1 + E(X2)*p2 + ...

Another useful quantity is the standard deviation sd(X) of X (it can be used as an informal measure of the precision of the estimate of E(X)). It is given by

sd(X)^2 = sd(X1)^2*p1 + sd(X2)^2*p2 + ...


This is not a panacea, especially if the largest values of X occur with puzzles having a small or a large number of clues.
These formulae also show that, when studying X, it is generally useless to analyze the full collection at disposal. Analysing millions of puzzles with low probability numbers of clues will not improve the results. It is more efficient to randomly extract a sub-collection.


Concerning JExocets, for the puzzles in the "potential hardest" collection, almost all of them are found for puzzles with k = 22, 23, 24 clues.
If this remained true for a top-down generated collection, this would be enough to entail that JExocets are present in at most 2% of the minimal puzzles.
(Indeed, my own estimates, based on other calculations, put it very close to 0%).


References:

- "Unbiased Statistics of a CSP - A Controlled-Bias Generator", International Joint Conferences on Computer, Information, Systems Sciences and Engineering (CISSE 09), December 4-12, 2009, Springer. pdf preprint. Published as a chapter of the book Innovations in Computing Sciences and Software Engineering, Khaled Elleithy Editor, pp. 11-17, Springer, 2010, ISBN 97890481911133.
- "Constraint Resolution Theories", Lulu.com, Oct. 2011, ISBN : 978-1-4478-6888-0
- "Pattern-Based Constraint Satisfaction", chapter 6, Lulu.com, Nov. 2012, ISBN 978-1-291-20339-4


[Edit 2013/06/20: added precision of the distribution and reference to p. 43]
Last edited by denis_berthier on Thu Jun 20, 2013 4:13 am, edited 2 times in total.
denis_berthier
2010 Supporter
 
Posts: 1253
Joined: 19 June 2007
Location: Paris

Re: The real distribution of minimal puzzles

Postby champagne » Mon Jun 03, 2013 7:04 am

denis_berthier wrote: Main result

The main result of this thread was, we now have a close estimate for the real distribution of minimal puzzles wrt to their number of clues (as a result, we also have a close estimate for the total number of minimal puzzles). Let me recall the distribution here for ease of use.


Code: Select all
#clues      %puzzles
20          0.0
21          0.000034
22          0.0034
23          0.149
24          2.28
25          13.42
26          31.94
27          32.74
28          15.48
29          3.56
30          0.41
31          0.022



if the last eleven's generation is representative, the distribution in the grey area is completely different
eleven found 93% of the puzzles in the range 23_26 clues for a rating 8.6 and more
champagne
2017 Supporter
 
Posts: 5683
Joined: 02 August 2007
Location: France Brittany

Re: The real distribution of minimal puzzles

Postby denis_berthier » Mon Jun 03, 2013 7:36 am

champagne wrote:if the last eleven's generation is representative, the distribution in the grey area is completely different
eleven found 93% of the puzzles in the range 23_26 clues for a rating 8.6 and more


As I wrote at the start of my post, these stats can't be applied to any sub-collection, such as a "grey area" based on SER values.
Moreover, as I also wrote, collections generated by a top-down generator are known to be strongly biased.
denis_berthier
2010 Supporter
 
Posts: 1253
Joined: 19 June 2007
Location: Paris

Re: The real distribution of minimal puzzles

Postby champagne » Mon Jun 03, 2013 8:11 am

denis_berthier wrote:
champagne wrote:if the last eleven's generation is representative, the distribution in the grey area is completely different
eleven found 93% of the puzzles in the range 23_26 clues for a rating 8.6 and more


As I wrote at the start of my post, these stats can't be applied to any sub-collection, such as a "grey area" based on SER values.
Moreover, as I also wrote, collections generated by a top-down generator are known to be strongly biased.


right as a general frequency, but the frequency in the grey area has more to do with the collection of puzzles eligible to the grey area
champagne
2017 Supporter
 
Posts: 5683
Joined: 02 August 2007
Location: France Brittany

Re: The real distribution of minimal puzzles

Postby denis_berthier » Mon Jun 03, 2013 8:16 am

champagne wrote:right as a general frequency, but the frequency in the grey area has more to do with the collection of puzzles eligible to the grey area

Could you stop polluting every possible thread with your compulsive reactions, that are both out of topic and devoid of any content?
denis_berthier
2010 Supporter
 
Posts: 1253
Joined: 19 June 2007
Location: Paris

Re: The real distribution of minimal puzzles

Postby champagne » Mon Jun 03, 2013 8:18 am

denis_berthier wrote:
champagne wrote:right as a general frequency, but the frequency in the grey area has more to do with the collection of puzzles eligible to the grey area

Could you stop polluting every possible thread with your compulsive reactions devoid of any content?



could you pleas tell what is wrong in my sentence.

As far as I know, the frequency in the grey area is the ratio

puzzles of the grey collection having the property "x" / total number of puzzles in that collection
champagne
2017 Supporter
 
Posts: 5683
Joined: 02 August 2007
Location: France Brittany

Re: The real distribution of minimal puzzles

Postby denis_berthier » Mon Jun 03, 2013 8:31 am

champagne wrote:
denis_berthier wrote:
champagne wrote:the frequency in the grey area has more to do with the collection of puzzles eligible to the grey area

Could you stop polluting every possible thread with your compulsive reactions devoid of any content?

could you pleas tell what is wrong in my sentence.


If you need explanations, here they are:

1) In standard English (but this can be translated into any natural language):
the sentence "the frequency in the grey area has more to do with the collection of puzzles eligible to the grey area"
is equivalent to
"the frequency in the grey area is more about puzzles in the grey area"
which would be a tautology without the word "more".

2) I clearly declared the grey area was OOT. That's the problem in general with compulsive posters like you: answering before they have understood what's being discussed. As a result, every thread becomes polluted with irrelevant posts.

To be clear: I'm not expecting any answer !!!!!
And I would be the happier if Jason deleted all the parasitic posts after this http://forum.enjoysudoku.com/post227502.html#p227502

(I mean both champagne and mine, including the present one)
denis_berthier
2010 Supporter
 
Posts: 1253
Joined: 19 June 2007
Location: Paris

Re: The real distribution of minimal puzzles

Postby David P Bird » Mon Jun 03, 2013 9:36 am

I'm interested in the practical value of these statistical analyses.

In the general population a very high proportion of puzzles are solvable by simple methods and these are of no great interest to contributors to the "Advanced Solving Techniques" sub forum. What remains therefore is the set of "interesting" puzzles that will have a completely different statistics from the uninteresting ones.

In that set we would now like to know how frequently often our different advanced techniques
a) Can be applied (ie how frequently the required pattern will occur)
b) Will be successful in producing a reduction in the puzzle

Using these measures we can then establish where in our hierarchy of methods they should be placed.

At the moment I can't see much point in precisely determining the frequency of these methods being applicable to the general population. We already know we won't be needing them for the uninteresting puzzles. Whether we should gain by using them when they are unnecessary as "sledge hammers to crack nuts" is of relatively minor importance.

However I can see that the boundary between the interesting and non interesting puzzles won't be fixed, and will depend on where in the hierarchy of methods we start classing our methods as advanced.

I'm not making this as adversarial point, and I don’t intend to engage in point-scoring arguments. I simply want to explore how a statistical analysis approach can be used to most benefit.
David P Bird
2010 Supporter
 
Posts: 960
Joined: 16 September 2008
Location: Middle England

Re: The real distribution of minimal puzzles

Postby denis_berthier » Mon Jun 03, 2013 10:00 am

Hi David,

The above recalled stats for the general population of minimal puzzles are not the end of the story.
I also provided detailed stats for the W rating (see the same references).
And I showed that including Subsets in the set of rules didn't change the stats significantly.

The fact that "non-interesting" puzzles are included in the above stats is not a problem. Depending on where you put the boundary, it could be easy to produce stats for the remaining "interesting" ones (e.g. if you discard only the puzzles solvable by Singles and Whips[1]). The only question is where you put the boundary.

I plainly agree with what you say: estimating the practical usefulness of a rule depends on where we put it in a hierarchy of rules. This would undoubtedly become very controversial as soon as we go beyond the following rules:
- Singles
- Whips[1], i.e. basic interactions

If such discussions should ever occur, I'd prefer to have them in another thread; they would clearly be OOT here.
denis_berthier
2010 Supporter
 
Posts: 1253
Joined: 19 June 2007
Location: Paris

The "dead clues" bias in puzzle generation process

Postby dobrichev » Sat Jun 08, 2013 11:20 am

One of the side effects of the puzzle minimization algorithms discussed here is the possibility to generate all puzzles for a subgrid or even (theoretically) for entire solution grid.
The algorithm works on a fixed grid and accepts a set of forced non-givens. If the non-givens are empty, then the entire grid is enumerated. For non-empty non-givens the givens are minimized so that all minimal puzzles are generated.
The algorithm is based on the concept that all unavoidable sets within the grid must be hit with at least one clue (Gary McGuire), and further on concept that generation of duplicate puzzles can be discarded by marking the cells that have already been processed within the given context as "dead clues". AFAIK the "dead clues" concept originates from this Ocean's post in 2006.
Since the algorithm is depth first search, it starts producing results almost immediately.

The exercise

I used my recent implementation of this algorithm in gridchecker.
I used 2 source grids - the one with the most known number of hardest puzzles, and one of the blue's average which is average in another sense.
For the first grid I generated all minimal puzzles for some areas of forced non-givens. For this exercise I used the puzzles within 44 fixed givens (h44, respectively 81-44=37 forced non-givens), and the starting part of all puzzles (h81, without any restrictions except the given unique solution grid and minimality).
For the second grid I generated starting part of all puzzles (am81).
Then I compared distributions to these from Denis Berthier (DB).

Results

Code: Select all
     am81      h81    h44       am81%        h81%        h44%        DB%
20                          0.0000000   0.0000000   0.0000000   0.000000
21                          0.0000000   0.0000000   0.0000000   0.000034
22                          0.0000000   0.0000000   0.0000000   0.003400
23                      9   0.0000000   0.0000000   0.0030945   0.149000
24              74   1546   0.0000000   0.0021854   0.5315692   2.280000
25            5834  21715   0.0000000   0.1722955   7.4663815  13.420000
26     31   109274  71286   0.0112368   3.2271888  24.5106365  31.940000
27   2609   674117 115864   0.9457010  19.9086958  39.8381224  32.740000
28  36372  1358764  68328  13.1839930  40.1283740  23.4935720  15.480000
29 105191   939847  11389  38.1292591  27.7564993   3.9159392   3.560000
30  96402   259322    694  34.9434537   7.6585560   0.2386216   0.410000
31  31231    35700      6  11.3205017   1.0543280   0.0020630   0.022000
32   3757     2980          1.3618240   0.0880083   0.0000000   0.000000
33    273      130          0.0989561   0.0038393   0.0000000   0.000000
34     14        1          0.0050747   0.0000295   0.0000000   0.000000
   275880  3386043 290837 100.0000000 100.0000000 100.0000000 100.004434


distrib.PNG
The bias
distrib.PNG (11.15 KiB) Viewed 733 times


Comments

h44 sample is the closest to DB. The possible origins of the bias are the selection of an untypical grid and/or the selection of the forced non-givens. This sample differs from others in that that it is a complete enumeration and therefore it is absolutely insensitive to the method of its generation.

h81 sample has over 3 millions of puzzles, but they are generated in the very very beginning of the process where many of the branches in the search tree are in their first iteration. The implemented "dead clues" concept is firstly to try with a cell and after whole enumeration of the subtree to mark the cell as "dead" and continue with the next. So the possible duplicates are generated at the early stages and skipped at the latter stages. This is the point where the bias comes from. Technically inverse implementation is possible and then the duplicates would be skipped at first stages and generated in latter.

am81 is a 10 times shorter sample from a different grid which confirms the bias.

The observed side effect could be used for generation from scratch of minimal puzzles with relatively high number of clues.
dobrichev
2016 Supporter
 
Posts: 1316
Joined: 24 May 2010

Re: The real distribution of minimal puzzles

Postby denis_berthier » Sat Jun 08, 2013 11:50 am

HI dobrichev,

One main difference between the controlled-bias generator and your approach is, the puzzles you get are strongly correlated. They can't be used to compute statistics such as classification or frequency of some rule.
denis_berthier
2010 Supporter
 
Posts: 1253
Joined: 19 June 2007
Location: Paris

Re: The real distribution of minimal puzzles

Postby Red Ed » Sat Jun 08, 2013 3:44 pm

Strictly speaking, correlation doesn't kill the frequency estimates, it's (unknown) bias that does that. Subtree enumeration can work fine if you have an unbiased source of input grids, rather than just two as dobrichev has.
Red Ed
 
Posts: 633
Joined: 06 June 2005

Re: The real distribution of minimal puzzles

Postby denis_berthier » Sun Jun 09, 2013 3:55 am

Red Ed wrote:Strictly speaking, correlation doesn't kill the frequency estimates

As it kills the estimates for the standard deviation, it makes the frequency estimates of little use.
denis_berthier
2010 Supporter
 
Posts: 1253
Joined: 19 June 2007
Location: Paris

Re: The real distribution of minimal puzzles

Postby Red Ed » Sun Jun 09, 2013 6:50 am

denis_berthier wrote:As it kills the estimates for the standard deviation, it makes the frequency estimates of little use.
Not the case for subtree(#) enumeration, where the random variables are of the form f(S), S being a randomly-picked s-clue subgrid. Treating f(S) as a black box generator, its outputs are uncorrelated, and one can estimate the variance from the data. The fact that the internal workings of S (for each instance, a collection of c-clue puzzles) are strongly correlated doesn't matter.

Perhaps none of this matters to dobrichev anyway, but I just don't want to put him off exploring subtree-type methods.

(#) okay, subforest
Red Ed
 
Posts: 633
Joined: 06 June 2005

Next

Return to General