## The real distribution of minimal puzzles

Everything about Sudoku that doesn't fit in one of the other sections

### Re: The real distribution of minimal puzzles

Red Ed wrote:I wonder if dobrichev, whose brief foray into subtree methods sparked this exchange, is feeling any better informed now?

Hi,
All the methods for estimation of all minimals are based on as larger as possible sample and incorporate weighting up to some predefined degree of details. Nothing new here, although claiming ownership on centuries old mathematical methods sounds a bit strange.

Playing with distribution of distributions is a little confusing to me. Determining the distribution of, say, 25-clue minimals over several samples is targeting the mean/median/average/whatever that finally will be included in the final distribution of the n-clue puzzles. We want the the individual n-clue probabilities to sum to 1 and I am not sure for the right way this should be done.

A typical usage of the general distribution - this over the whole space - is when you use it as a reference to underline something special = something non-average. Nevertheless it seems I posted in wrong topic, using the "real" distribution as reference. Sorry, Denis.

I hate dividing something close to zero by other thing close to zero. This limits the applicability of the general distribution. A possible workaround is doing local investigations targeting the proportion N(k) / N(k+1) and using the subtree as the only option. I have no idea if these local observations are reusable outside the same local context.

We know UA sets are relatively well defined abstraction which determines the grid and respectively its puzzles. We know that every solution grid has 6.67e21 (non-minimal) unavoidable sets. We know that for different grids the ratio between unaviodable rectangles and all UA is between 0 and 36 / 6.67e21. Aren't rectangles extremely rare? If so, why we don't just ignore them? Because of the significance of the weighting, and no subjections here. Denis?

I hope some day some will approximate this distribution by product of the yet unknown uniques and minimality functions.

Any low-clue puzzle kills the minimality of all its supersets which is huge number. Can we solve the opposite problem - estimate the low-clue (or high-clue) fertility of a given subgrid by processing the counts of the easily obtainable moderate-clue minimal puzzles?

For the hard puzzles hunters, below is my first observation related to the "real" distribution.
Hidden Text: Show
Take the grid with most puzzles (294) in the recent version of champagne's the hardest collection.
123456789457189236689372154268794315391265478574813692732948561815627943946531827
The union of all givens within all puzzles results in this 40-clue non-minimal puzzle.
1.3.5678..5..8..366..3.2....6..9431539..65..85.4..369..3...8.6.8.56....39.6.3.8..
Minimizing results in 8144 puzzles with the following rating
Code: Select all
`     64 11.1/1.2/1.2      7 11.0/1.2/1.2    172 10.9/1.2/1.2    408 10.8/1.2/1.2====================    651 above the collection treshold     18 10.7/1.2/1.2    154 10.6/1.2/1.2    366 10.5/1.2/1.2    848 10.4/1.2/1.2   1513 10.3/1.2/1.2    129 10.2/1.2/1.2     94  9.8/1.2/1.2    157  9.7/1.2/1.2    120  9.6/1.2/1.2     14  9.5/1.2/1.2    593  9.3/1.2/1.2   2160  9.2/1.2/1.2    305  9.1/1.2/1.2   1022  9.0/1.2/1.2====================   8144 total`

There is anomaly in the number of clues distribution of these puzzles - 26, 27, and 28-clue minimals have equal frequency of ~25%. The anomaly disappears when adding one clue or removing one. By adding more clues, the distribution asymptotically tends to the "real" one, suggesting that the origin of the anomaly is the region but not the whole grid.
Below is the distribution for 39, 40, and 41-clue non-minimal, along with the "real" DB distribution.
Code: Select all
`        39      40      41     DB22  0.0000  0.0000  0.0000  0.00323  0.0808  0.0368  0.0108  0.14924  3.2866  2.1365  1.1535  2.28025 22.4676 17.7922 10.8472 13.42026 23.8685 25.2087 25.8243 31.94027 31.7887 25.4174 36.5736 32.74028 17.6454 25.4297 22.7092 15.48029  0.8620  3.7573  2.7750  3.56030  0.0000  0.2210  0.1061  0.41031  0.0000  0.0000  0.0000  0.022`

And yes, I am feeling better informed now, even if the exchange only repeats or reformulates already discussed matter. Thank you guys.
dobrichev
2016 Supporter

Posts: 1621
Joined: 24 May 2010

### Re: The real distribution of minimal puzzles

dobrichev wrote:All the methods for estimation of all minimals are based on as larger as possible sample and incorporate weighting up to some predefined degree of details. Nothing new here, although claiming ownership on centuries old mathematical methods sounds a bit strange.

As you are so well informed about centuries old mathematical methods, I would be very interested if you could give any mathematical reference of how to generate uncorrelated unbiased (or with known bias) samples?

dobrichev wrote:A typical usage of the general distribution - this over the whole space - is when you use it as a reference to underline something special = something non-average.

Of course, you can use a distribution to say that some situation is not typical, but this is not typical usage. Typical usage is to use the distribution where it is meaningful, i.e. where the weight is. The tail of the distribution is relevant to what is called "statistics of extremes".
In the past years, this forum has concentrated on generating the "hardest" puzzles that no real player will ever be able to solve, even using the exotic rules - in spite of champagne's constant propaganda to the contrary. Indeed, this forum has become a second programmer's forum because no one can really study these puzzles/rules manually. I don't mean they are not interesting, I just don't accept to call this typical. As you can see from the discussion, about 10 people in the world are interested in these topics.

dobrichev wrote:I hate dividing something close to zero by other thing close to zero. This limits the applicability of the general distribution. A possible workaround is doing local investigations targeting the proportion N(k) / N(k+1) and using the subtree as the only option. I have no idea if these local observations are reusable outside the same local context.

If you use the full subtree of minimals generated from a complete grid, my P(k)/P(k+1) formula remains valid; but not if you use the subtree generated from a non-complete grid.

dobrichev wrote:We know UA sets are relatively well defined abstraction which determines the grid and respectively its puzzles. We know that every solution grid has 6.67e21 (non-minimal) unavoidable sets. We know that for different grids the ratio between unaviodable rectangles and all UA is between 0 and 36 / 6.67e21. Aren't rectangles extremely rare? If so, why we don't just ignore them? Because of the significance of the weighting, and no subjections here. Denis?

My interest in Sudoku (and other logic puzzles) is in pattern-based resolution. The only goal of my short incursion in generation and statistics was to compute the distribution of the W rating. For this I had to define a new kind of generator (because Red Ed had suggested that our current generators might be biased), but this was not my main goal. As for UA sets, you're welcome to study them, but this is not my personal cup of tea.

dobrichev wrote:For the hard puzzles hunters, below is my first observation related to the "real" distribution.
Take the grid with most puzzles (294) in the recent version of champagne's the hardest collection.
123456789457189236689372154268794315391265478574813692732948561815627943946531827
[...]
There is anomaly in the number of clues distribution of these puzzles

There's no anomaly at all. The distribution of a random variable on a subset of the probability space is rarely the same as its distribution on the whole set.
denis_berthier
2010 Supporter

Posts: 1261
Joined: 19 June 2007
Location: Paris

### Re: The real distribution of minimal puzzles

Ilf and Petrov wrote:There were so many hairdressing establishments and funeral homes in the regional centre of N. that the inhabitants seemed to be born merely in order to have a shave, get their hair cut, freshen up their heads with toilet water and then die. In actual fact, people came into the world, shaved, and died rather rarely in the regional centre of N. Life in N. was extremely quiet. The spring evenings were delightful, the mud glistened like anthracite in the light of the moon, and all the young men of the town were so much in love with the secretary of the communal-service workers' local committee that she found difficulty in collecting their subscriptions.
dobrichev
2016 Supporter

Posts: 1621
Joined: 24 May 2010

### Re: The real distribution of minimal puzzles

Comparing the controlled-bias generator with the subset/superset method

For completeness reasons, as Red Ed's subset/superset method has been evoked earlier in this thread, and now that, thanks to Afmob's and Blue's calculations, we have detailed results for puzzles generated by it, I think necessary to report here the "efficiency" of that method when it comes to estimating the mean or the distribution of a random variable X of interest in Sudoku solving (X=SER, X=W rating, X="solvable by SSTS", X="has an sk-loop" ...).

Due to correlation of the puzzles generated by subset/superset, for the same relative error, the subset/superset method would require the computation of X for about 500 times more puzzles than with the controlled-bias generator
(see section 2 of this post http://forum.enjoysudoku.com/counting-minimal-puzzles-subsets-supersets-etc-t31190-147.html for details). This makes it totally unfit for such purposes*.

(*) but, if one is only interested in the number of clues of puzzles or the number of minimal puzzles with n clues, it is interesting for n "far" from the mean value.
denis_berthier
2010 Supporter

Posts: 1261
Joined: 19 June 2007
Location: Paris

Previous