## The real distribution of minimal puzzles

Everything about Sudoku that doesn't fit in one of the other sections

### Re: The real distribution of minimal puzzles

I'm interested in the practical value of these statistical analyses.

In the general population a very high proportion of puzzles are solvable by simple methods and these are of no great interest to contributors to the "Advanced Solving Techniques" sub forum. What remains therefore is the set of "interesting" puzzles that will have a completely different statistics from the uninteresting ones.

In that set we would now like to know how frequently often our different advanced techniques
a) Can be applied (ie how frequently the required pattern will occur)
b) Will be successful in producing a reduction in the puzzle

Using these measures we can then establish where in our hierarchy of methods they should be placed.

At the moment I can't see much point in precisely determining the frequency of these methods being applicable to the general population. We already know we won't be needing them for the uninteresting puzzles. Whether we should gain by using them when they are unnecessary as "sledge hammers to crack nuts" is of relatively minor importance.

However I can see that the boundary between the interesting and non interesting puzzles won't be fixed, and will depend on where in the hierarchy of methods we start classing our methods as advanced.

I'm not making this as adversarial point, and I don’t intend to engage in point-scoring arguments. I simply want to explore how a statistical analysis approach can be used to most benefit.
David P Bird
2010 Supporter

Posts: 1040
Joined: 16 September 2008
Location: Middle England

### Re: The real distribution of minimal puzzles

Hi David,

The above recalled stats for the general population of minimal puzzles are not the end of the story.
I also provided detailed stats for the W rating (see the same references).
And I showed that including Subsets in the set of rules didn't change the stats significantly.

The fact that "non-interesting" puzzles are included in the above stats is not a problem. Depending on where you put the boundary, it could be easy to produce stats for the remaining "interesting" ones (e.g. if you discard only the puzzles solvable by Singles and Whips[1]). The only question is where you put the boundary.

I plainly agree with what you say: estimating the practical usefulness of a rule depends on where we put it in a hierarchy of rules. This would undoubtedly become very controversial as soon as we go beyond the following rules:
- Singles
- Whips[1], i.e. basic interactions

If such discussions should ever occur, I'd prefer to have them in another thread; they would clearly be OOT here.
denis_berthier
2010 Supporter

Posts: 1262
Joined: 19 June 2007
Location: Paris

### The "dead clues" bias in puzzle generation process

One of the side effects of the puzzle minimization algorithms discussed here is the possibility to generate all puzzles for a subgrid or even (theoretically) for entire solution grid.
The algorithm works on a fixed grid and accepts a set of forced non-givens. If the non-givens are empty, then the entire grid is enumerated. For non-empty non-givens the givens are minimized so that all minimal puzzles are generated.
The algorithm is based on the concept that all unavoidable sets within the grid must be hit with at least one clue (Gary McGuire), and further on concept that generation of duplicate puzzles can be discarded by marking the cells that have already been processed within the given context as "dead clues". AFAIK the "dead clues" concept originates from this Ocean's post in 2006.
Since the algorithm is depth first search, it starts producing results almost immediately.

The exercise

I used my recent implementation of this algorithm in gridchecker.
I used 2 source grids - the one with the most known number of hardest puzzles, and one of the blue's average which is average in another sense.
For the first grid I generated all minimal puzzles for some areas of forced non-givens. For this exercise I used the puzzles within 44 fixed givens (h44, respectively 81-44=37 forced non-givens), and the starting part of all puzzles (h81, without any restrictions except the given unique solution grid and minimality).
For the second grid I generated starting part of all puzzles (am81).
Then I compared distributions to these from Denis Berthier (DB).

Results

Code: Select all
`     am81      h81    h44       am81%        h81%        h44%        DB%20                          0.0000000   0.0000000   0.0000000   0.00000021                          0.0000000   0.0000000   0.0000000   0.00003422                          0.0000000   0.0000000   0.0000000   0.00340023                      9   0.0000000   0.0000000   0.0030945   0.14900024              74   1546   0.0000000   0.0021854   0.5315692   2.28000025            5834  21715   0.0000000   0.1722955   7.4663815  13.42000026     31   109274  71286   0.0112368   3.2271888  24.5106365  31.94000027   2609   674117 115864   0.9457010  19.9086958  39.8381224  32.74000028  36372  1358764  68328  13.1839930  40.1283740  23.4935720  15.48000029 105191   939847  11389  38.1292591  27.7564993   3.9159392   3.56000030  96402   259322    694  34.9434537   7.6585560   0.2386216   0.41000031  31231    35700      6  11.3205017   1.0543280   0.0020630   0.02200032   3757     2980          1.3618240   0.0880083   0.0000000   0.00000033    273      130          0.0989561   0.0038393   0.0000000   0.00000034     14        1          0.0050747   0.0000295   0.0000000   0.000000   275880  3386043 290837 100.0000000 100.0000000 100.0000000 100.004434`

The bias
distrib.PNG (11.15 KiB) Viewed 970 times

h44 sample is the closest to DB. The possible origins of the bias are the selection of an untypical grid and/or the selection of the forced non-givens. This sample differs from others in that that it is a complete enumeration and therefore it is absolutely insensitive to the method of its generation.

h81 sample has over 3 millions of puzzles, but they are generated in the very very beginning of the process where many of the branches in the search tree are in their first iteration. The implemented "dead clues" concept is firstly to try with a cell and after whole enumeration of the subtree to mark the cell as "dead" and continue with the next. So the possible duplicates are generated at the early stages and skipped at the latter stages. This is the point where the bias comes from. Technically inverse implementation is possible and then the duplicates would be skipped at first stages and generated in latter.

am81 is a 10 times shorter sample from a different grid which confirms the bias.

The observed side effect could be used for generation from scratch of minimal puzzles with relatively high number of clues.
dobrichev
2016 Supporter

Posts: 1642
Joined: 24 May 2010

### Re: The real distribution of minimal puzzles

HI dobrichev,

One main difference between the controlled-bias generator and your approach is, the puzzles you get are strongly correlated. They can't be used to compute statistics such as classification or frequency of some rule.
denis_berthier
2010 Supporter

Posts: 1262
Joined: 19 June 2007
Location: Paris

### Re: The real distribution of minimal puzzles

Strictly speaking, correlation doesn't kill the frequency estimates, it's (unknown) bias that does that. Subtree enumeration can work fine if you have an unbiased source of input grids, rather than just two as dobrichev has.
Red Ed

Posts: 633
Joined: 06 June 2005

### Re: The real distribution of minimal puzzles

Red Ed wrote:Strictly speaking, correlation doesn't kill the frequency estimates

As it kills the estimates for the standard deviation, it makes the frequency estimates of little use.
denis_berthier
2010 Supporter

Posts: 1262
Joined: 19 June 2007
Location: Paris

### Re: The real distribution of minimal puzzles

denis_berthier wrote:As it kills the estimates for the standard deviation, it makes the frequency estimates of little use.
Not the case for subtree(#) enumeration, where the random variables are of the form f(S), S being a randomly-picked s-clue subgrid. Treating f(S) as a black box generator, its outputs are uncorrelated, and one can estimate the variance from the data. The fact that the internal workings of S (for each instance, a collection of c-clue puzzles) are strongly correlated doesn't matter.

Perhaps none of this matters to dobrichev anyway, but I just don't want to put him off exploring subtree-type methods.

(#) okay, subforest
Red Ed

Posts: 633
Joined: 06 June 2005

### Re: The real distribution of minimal puzzles

Red Ed wrote:where the random variables are of the form f(S), S being a randomly-picked s-clue subgrid.

I'm not saying that dobrichev's method may not be interesting for some purposes.
But, do you have any example of such a random variable that could be of any practical use in estimating the frequency of a real resolution rule in the set of minimal puzzles?
denis_berthier
2010 Supporter

Posts: 1262
Joined: 19 June 2007
Location: Paris

### Re: The real distribution of minimal puzzles

denis_berthier wrote:do you have any example of such a random variable that could be of any practical use in estimating the frequency of a real resolution rule in the set of minimal puzzles?
I don't really follow the resolution rules discussion, so can't answer directly. Generically, though, if you want to estimate the sum over c-clue minimal puzzles of R(puzzle), where R is any function (e.g. counting resolution rules), then, with S as described previously, one would compute

f(S) = sum_{ P is a c-clue minimal puzzle in S }( R(P) * (s-c)!81! / (81-c)!s! )
which is what I used for my estimates of the number of c-clue minimal puzzles (where, simply, R(puzzle) = 1). (EDIT: typo corrected.)
Last edited by Red Ed on Sun Jun 09, 2013 5:08 pm, edited 1 time in total.
Red Ed

Posts: 633
Joined: 06 June 2005

### Re: The real distribution of minimal puzzles

The question isn't estimating the frequency, but estimating its standard deviation.
denis_berthier
2010 Supporter

Posts: 1262
Joined: 19 June 2007
Location: Paris

### Re: The real distribution of minimal puzzles

denis_berthier wrote:The question isn't estimating the frequency, but estimating its standard deviation.
That was answered previously: you would do it empirically, based on the observed values of the random variable. I don't have the reference to hand, but presumably you are doing the same when you quote standard deviations for your sudoku statistics.
Red Ed

Posts: 633
Joined: 06 June 2005

### Re: The real distribution of minimal puzzles

Red Ed wrote:you would do it empirically, based on the observed values of the random variable [...] presumably you are doing the same when you quote standard deviations for your sudoku statistics.

No: thanks to non correlation of the puzzles produced by the controlled-biased generator, the formula for the standard deviation estimate I wrote above is exact.
denis_berthier
2010 Supporter

Posts: 1262
Joined: 19 June 2007
Location: Paris

### Re: The real distribution of minimal puzzles

denis_berthier wrote:No: thanks to non correlation of the puzzles produced by the controlled-biased generator, the formula for the standard deviation estimate I wrote above is exact.
Your individual sd(Xk) terms are estimated from the data, thus so is sd(X). That's good, ordinary, statistics. It works perfectly well and so, for the same reason, does the estimation of statistics of my f(S). Circling back round, let me reiterate: subtree enumeration works just fine for the minimal puzzle statistics.
Red Ed

Posts: 633
Joined: 06 June 2005

### Re: The real distribution of minimal puzzles

Red Ed wrote:Circling back round, let me reiterate: subtree enumeration works just fine for the minimal puzzle statistics.

You can reiterate as long as you want, you don't know how the standard deviation varies with the number of puzzles generated in one subtree.
denis_berthier
2010 Supporter

Posts: 1262
Joined: 19 June 2007
Location: Paris

### Re: The real distribution of minimal puzzles

denis_berthier wrote:you don't know how the standard deviation varies with the number of puzzles generated in one subtree.
I think we need to take a step back here. The random variable f(S) is an unbiased estimator for a thing we'd like to count (puzzles, resolution rules, whatever). We'd like to know the mean and variance of f(S), so we do that empirically by gathering a load of sample data and computing the sample mean and variance. That is the statistically correct thing to do, irrespective of what might be going on "internal" to S.

I dare say you don't know how the standard deviation of SER various with the number of U4s in the solution grid (to pick an arbitrary factor that might influence your Xk variables), but that's "internal" to Xk and has no bearing on the correctness of your calculations based on observed values of Xk.
Red Ed

Posts: 633
Joined: 06 June 2005

PreviousNext