The New Sudoku Players' Forum

by **JasonLion** » Sat Mar 26, 2011 12:57 pm

Thanks to denis_berthier we now have PDF files of posts made to this topic between July 2009 and January 2010.

Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
Page 7
Page 8
Page 9
Page 10
Page 11
Page 12
Page 13
Page 14
Page 15
Page 16
Page 17
Page 18
Page 19
Page 20
Page 21
Page 22
Page 23
Page 24
Page 25
Page 26
Page 27
Page 28
Page 29
Page 30
Page 31
Page 32
Page 33
Page 34
Page 35
Page 36
Page 37
Page 38
Page 39
Page 40
Page 41
Page 42
Page 43

Website · by **denis_berthier** » Mon Jun 03, 2013 5:37 am

How to compute unbiased statistics

Considering some recent discussions about unbiased statistics in the "grey zone" and the "JExocet" threads, it seems that some of the results reached in this old lost thread and some of their obvious consequences have been forgotten.
For a summary of the essential results of this thread, see the references below.

Note: the following applies to the whole collection of minimal puzzles. It can obviously not be extended to a restricted collection defined by particular properties, such as having a rating larger than some predefined value (wrt to some predefined rating) or having some predefined pattern.

Main result

The main result of this thread was, we now have a close estimate for the real distribution of minimal puzzles wrt to their number of clues (as a result, we also have a close estimate for the total number of minimal puzzles). Let me recall the distribution here for ease of use.

Code: Select all: #clues %puzzles 20 1.32E-7 21 0.000034 22 0.00348 23 0.148 24 2.28 25 13.42 26 31.91 27 32.71 28 15.48 29 3.598 30 0.41 31 0.0241 32 0.00102 these results are valid with overall relative precision 0.065% see page 43 of the pdf copies of the old forum for more detailed results

Puzzles outside the [20, 31] range can be neglected in any statistical study bearing on the whole collection of minimal puzzles (e.g. the frequency of puzzles having some pattern).

Let pk be the proportion of puzzles with k clues, according to this estimate.

Practical applications

As of today, we are still unable to generate unbiased uncorrelated collections of puzzles.
The best we can generate remains uncorrelated collections with controlled-bias (this is how the above results were obtained), and such generation remains very time consuming (about 257,000 times longer than top-down generation).

However, do we need a controlled-bias collection to produce unbiased statistics? NO. The above results are valid once and for all and they can be used as such.
Of course, the closer to unbiased the better and controlled-bias is the closest we currently have.
But suppose we have another uncorrelated collection, generated e.g. by a top-down generator (whose bias is much larger than the controlled-bias) or a bottom-up one (whose bias is still much larger).

Suppose we are interested in some random variable X, e.g.:
- X = 1 if the puzzle has an sk-loop, 0 otherwise
- X = 1 if the puzzle has a J-Exocet, 0 otherwise
- X = number of J-Exocets in the puzzle
- X = number of eliminations by J-Exocets in the puzzle after some set of other rules (e.g. SSTS) has been applied
and suppose we want to estimate the real mean of X.

If, instead of simply taking the mean for all the puzzles in the collection, we use a weighted mean (using the pk as weights), we get an estimate of the unbiased mean value of X for the minimal puzzles. More precisely, let E(Xk) be the mean value of X for the puzzles with k clues in the collection. Then the unbiased estimate of the mean E(X) of X is merely:

E(X) = E(X1)*p1 + E(X2)*p2 + ...

Another useful quantity is the standard deviation sd(X) of X (it can be used as an informal measure of the precision of the estimate of E(X)). It is given by

sd(X)^2 = sd(X1)^2*p1 + sd(X2)^2*p2 + ...

This is not a panacea, especially if the largest values of X occur with puzzles having a small or a large number of clues.
These formulae also show that, when studying X, it is generally useless to analyze the full collection at disposal. Analysing millions of puzzles with low probability numbers of clues will not improve the results. It is more efficient to randomly extract a sub-collection.

Concerning JExocets, for the puzzles in the "potential hardest" collection, almost all of them are found for puzzles with k = 22, 23, 24 clues.
If this remained true for a top-down generated collection, this would be enough to entail that JExocets are present in at most 2% of the minimal puzzles.
(Indeed, my own estimates, based on other calculations, put it very close to 0%).

References:

- "Unbiased Statistics of a CSP - A Controlled-Bias Generator", International Joint Conferences on Computer, Information, Systems Sciences and Engineering (CISSE 09), December 4-12, 2009, Springer. pdf preprint. Published as a chapter of the book Innovations in Computing Sciences and Software Engineering, Khaled Elleithy Editor, pp. 11-17, Springer, 2010, ISBN 97890481911133.
- "Constraint Resolution Theories", Lulu.com, Oct. 2011, ISBN : 978-1-4478-6888-0
- "Pattern-Based Constraint Satisfaction", chapter 6, Lulu.com, Nov. 2012, ISBN 978-1-291-20339-4

[Edit 2013/06/20: added precision of the distribution and reference to p. 43]

by **champagne** » Mon Jun 03, 2013 7:04 am

denis_berthier wrote: Main result

The main result of this thread was, we now have a close estimate for the real distribution of minimal puzzles wrt to their number of clues (as a result, we also have a close estimate for the total number of minimal puzzles). Let me recall the distribution here for ease of use.

Code: Select all
#clues %puzzles 20 0.0 21 0.000034 22 0.0034 23 0.149 24 2.28 25 13.42 26 31.94 27 32.74 28 15.48 29 3.56 30 0.41 31 0.022

if the last eleven's generation is representative, the distribution in the grey area is completely different
eleven found 93% of the puzzles in the range 23_26 clues for a rating 8.6 and more

Website · by **denis_berthier** » Mon Jun 03, 2013 7:36 am

champagne wrote:if the last eleven's generation is representative, the distribution in the grey area is completely different
eleven found 93% of the puzzles in the range 23_26 clues for a rating 8.6 and more

As I wrote at the start of my post, these stats can't be applied to any sub-collection, such as a "grey area" based on SER values.
Moreover, as I also wrote, collections generated by a top-down generator are known to be strongly biased.

by **champagne** » Mon Jun 03, 2013 8:11 am

denis_berthier wrote:
champagne wrote:if the last eleven's generation is representative, the distribution in the grey area is completely different
eleven found 93% of the puzzles in the range 23_26 clues for a rating 8.6 and more

As I wrote at the start of my post, these stats can't be applied to any sub-collection, such as a "grey area" based on SER values.
Moreover, as I also wrote, collections generated by a top-down generator are known to be strongly biased.

right as a general frequency, but the frequency in the grey area has more to do with the collection of puzzles eligible to the grey area

Website · by **denis_berthier** » Mon Jun 03, 2013 8:16 am

champagne wrote:right as a general frequency, but the frequency in the grey area has more to do with the collection of puzzles eligible to the grey area

Could you stop polluting every possible thread with your compulsive reactions, that are both out of topic and devoid of any content?

by **champagne** » Mon Jun 03, 2013 8:18 am

denis_berthier wrote:
champagne wrote:right as a general frequency, but the frequency in the grey area has more to do with the collection of puzzles eligible to the grey area

Could you stop polluting every possible thread with your compulsive reactions devoid of any content?

could you pleas tell what is wrong in my sentence.

As far as I know, the frequency in the grey area is the ratio

puzzles of the grey collection having the property "x" / total number of puzzles in that collection

Website · by **denis_berthier** » Mon Jun 03, 2013 8:31 am

champagne wrote:
denis_berthier wrote:
champagne wrote:the frequency in the grey area has more to do with the collection of puzzles eligible to the grey area

Could you stop polluting every possible thread with your compulsive reactions devoid of any content?

could you pleas tell what is wrong in my sentence.

If you need explanations, here they are:

1) In standard English (but this can be translated into any natural language):
the sentence "the frequency in the grey area has more to do with the collection of puzzles eligible to the grey area"
is equivalent to
"the frequency in the grey area is more about puzzles in the grey area"
which would be a tautology without the word "more".

2) I clearly declared the grey area was OOT. That's the problem in general with compulsive posters like you: answering before they have understood what's being discussed. As a result, every thread becomes polluted with irrelevant posts.

To be clear: I'm not expecting any answer !!!!!
And I would be the happier if Jason deleted all the parasitic posts after this http://forum.enjoysudoku.com/post227502.html#p227502

(I mean both champagne and mine, including the present one)

by **David P Bird** » Mon Jun 03, 2013 9:36 am

I'm interested in the practical value of these statistical analyses.

In the general population a very high proportion of puzzles are solvable by simple methods and these are of no great interest to contributors to the "Advanced Solving Techniques" sub forum. What remains therefore is the set of "interesting" puzzles that will have a completely different statistics from the uninteresting ones.

In that set we would now like to know how frequently often our different advanced techniques
a) Can be applied (ie how frequently the required pattern will occur)
b) Will be successful in producing a reduction in the puzzle

Using these measures we can then establish where in our hierarchy of methods they should be placed.

At the moment I can't see much point in precisely determining the frequency of these methods being applicable to the general population. We already know we won't be needing them for the uninteresting puzzles. Whether we should gain by using them when they are unnecessary as "sledge hammers to crack nuts" is of relatively minor importance.

However I can see that the boundary between the interesting and non interesting puzzles won't be fixed, and will depend on where in the hierarchy of methods we start classing our methods as advanced.

I'm not making this as adversarial point, and I don’t intend to engage in point-scoring arguments. I simply want to explore how a statistical analysis approach can be used to most benefit.

Website · by **denis_berthier** » Mon Jun 03, 2013 10:00 am

Hi David,

The above recalled stats for the general population of minimal puzzles are not the end of the story.
I also provided detailed stats for the W rating (see the same references).
And I showed that including Subsets in the set of rules didn't change the stats significantly.

The fact that "non-interesting" puzzles are included in the above stats is not a problem. Depending on where you put the boundary, it could be easy to produce stats for the remaining "interesting" ones (e.g. if you discard only the puzzles solvable by Singles and Whips[1]). The only question is where you put the boundary.

I plainly agree with what you say: estimating the practical usefulness of a rule depends on where we put it in a hierarchy of rules. This would undoubtedly become very controversial as soon as we go beyond the following rules:
- Singles
- Whips[1], i.e. basic interactions

If such discussions should ever occur, I'd prefer to have them in another thread; they would clearly be OOT here.

by **dobrichev** » Sat Jun 08, 2013 11:20 am

One of the side effects of the puzzle minimization algorithms discussed here is the possibility to generate all puzzles for a subgrid or even (theoretically) for entire solution grid.
The algorithm works on a fixed grid and accepts a set of forced non-givens. If the non-givens are empty, then the entire grid is enumerated. For non-empty non-givens the givens are minimized so that all minimal puzzles are generated.
The algorithm is based on the concept that all unavoidable sets within the grid must be hit with at least one clue (Gary McGuire), and further on concept that generation of duplicate puzzles can be discarded by marking the cells that have already been processed within the given context as "dead clues". AFAIK the "dead clues" concept originates from this Ocean's post in 2006.
Since the algorithm is depth first search, it starts producing results almost immediately.

The exercise

I used my recent implementation of this algorithm in gridchecker.
I used 2 source grids - the one with the most known number of hardest puzzles, and one of the blue's average which is average in another sense.
For the first grid I generated all minimal puzzles for some areas of forced non-givens. For this exercise I used the puzzles within 44 fixed givens (h44, respectively 81-44=37 forced non-givens), and the starting part of all puzzles (h81, without any restrictions except the given unique solution grid and minimality).
For the second grid I generated starting part of all puzzles (am81).
Then I compared distributions to these from Denis Berthier (DB).

Results

Code: Select all: am81 h81 h44 am81% h81% h44% DB% 20 0.0000000 0.0000000 0.0000000 0.000000 21 0.0000000 0.0000000 0.0000000 0.000034 22 0.0000000 0.0000000 0.0000000 0.003400 23 9 0.0000000 0.0000000 0.0030945 0.149000 24 74 1546 0.0000000 0.0021854 0.5315692 2.280000 25 5834 21715 0.0000000 0.1722955 7.4663815 13.420000 26 31 109274 71286 0.0112368 3.2271888 24.5106365 31.940000 27 2609 674117 115864 0.9457010 19.9086958 39.8381224 32.740000 28 36372 1358764 68328 13.1839930 40.1283740 23.4935720 15.480000 29 105191 939847 11389 38.1292591 27.7564993 3.9159392 3.560000 30 96402 259322 694 34.9434537 7.6585560 0.2386216 0.410000 31 31231 35700 6 11.3205017 1.0543280 0.0020630 0.022000 32 3757 2980 1.3618240 0.0880083 0.0000000 0.000000 33 273 130 0.0989561 0.0038393 0.0000000 0.000000 34 14 1 0.0050747 0.0000295 0.0000000 0.000000 275880 3386043 290837 100.0000000 100.0000000 100.0000000 100.004434

: The bias; distrib.PNG (11.15 KiB) Viewed 3090 times

Comments

h44 sample is the closest to DB. The possible origins of the bias are the selection of an untypical grid and/or the selection of the forced non-givens. This sample differs from others in that that it is a complete enumeration and therefore it is absolutely insensitive to the method of its generation.

h81 sample has over 3 millions of puzzles, but they are generated in the very very beginning of the process where many of the branches in the search tree are in their first iteration. The implemented "dead clues" concept is firstly to try with a cell and after whole enumeration of the subtree to mark the cell as "dead" and continue with the next. So the possible duplicates are generated at the early stages and skipped at the latter stages. This is the point where the bias comes from. Technically inverse implementation is possible and then the duplicates would be skipped at first stages and generated in latter.

am81 is a 10 times shorter sample from a different grid which confirms the bias.

The observed side effect could be used for generation from scratch of minimal puzzles with relatively high number of clues.

Website · by **denis_berthier** » Sat Jun 08, 2013 11:50 am

HI dobrichev,

One main difference between the controlled-bias generator and your approach is, the puzzles you get are strongly correlated. They can't be used to compute statistics such as classification or frequency of some rule.

by **Red Ed** » Sat Jun 08, 2013 3:44 pm

Strictly speaking, correlation doesn't kill the frequency estimates, it's (unknown) bias that does that. Subtree enumeration can work fine if you have an unbiased source of input grids, rather than just two as dobrichev has.

Website · by **denis_berthier** » Sun Jun 09, 2013 3:55 am

Red Ed wrote:Strictly speaking, correlation doesn't kill the frequency estimates

As it kills the estimates for the standard deviation, it makes the frequency estimates of little use.

by **Red Ed** » Sun Jun 09, 2013 6:50 am

denis_berthier wrote:As it kills the estimates for the standard deviation, it makes the frequency estimates of little use.

Not the case for subtree(#) enumeration, where the random variables are of the form f(S), S being a randomly-picked s-clue subgrid. Treating f(S) as a black box generator, its outputs are uncorrelated, and one can estimate the variance from the data. The fact that the internal workings of S (for each instance, a collection of c-clue puzzles) are strongly correlated doesn't matter.

Perhaps none of this matters to dobrichev anyway, but I just don't want to put him off exploring subtree-type methods.

(#) okay, subforest

The New Sudoku Players' Forum

The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

The "dead clues" bias in puzzle generation process

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles