First of all, congratulations for the interesting results about the extreme numbers of clues. As I noticed long ago, high values are out of reach of the controlled-bias generator. This may not be true of low values, if sufficiently many clues are eliminated from the start (this was the first optimisation I found, and it may be pursued further for small numbers of clues).

Afmob wrote:In contrast to my former computations I do not only use one small sample size of 100k grids but using Red Ed's unbiased generator some samples (100k for subset method, 10k for superset method) are produced with each sample being used 10 times. Afterwards, new samples are generated. Therefore the relative errors are slightly underestimated but only by a small margin.

hopefully "small", but not calculated.

1) Mean number of minimal puzzles per complete grid

Afmob wrote:Subset results:

- Code: Select all
`Computation time: 15x240 hours`

601,942,000,000 samples

1,039,057,869 valid 28 hits

13,676,903,000 valid minimal puzzles

+----+------------+--------------+----------------+

| Cl | Count | E(nr/grid) | E(rel err)*100 |

+----+------------+--------------+----------------+

| 18 | 8 | 4.625e-01 | 3.536e+01 |

| 19 | 6092 | 2.219e+03 | 1.518e+00 |

| 20 | 1287655 | 3.231e+06 | 1.368e-01 |

| 21 | 67409525 | 1.290e+09 | 2.925e-02 |

| 22 | 984980363 | 1.615e+11 | 1.146e-02 |

| 23 | 4280735451 | 6.903e+12 | 6.447e-03 |

| 24 | 5689364460 | 1.064e+14 | 4.798e-03 |

| 25 | 2343546179 | 6.247e+14 | 5.106e-03 |

| 26 | 298326641 | 1.484e+15 | 9.373e-03 |

| 27 | 11149692 | 1.526e+15 | 3.568e-02 |

| 28 | 96934 | 7.163e+14 | 3.212e-01 |

+----+------------+--------------+----------------+

Supset results:

- Code: Select all

21,310,000,000 samples

1,199,199,372 minimal 26 hits

32,151,592 valid minimal puzzles

+----+------------+--------------+----------------+

| Cl | Count | E(nr/grid) | E(rel err)*100 |

+----+------------+--------------+----------------+

| 26 | 28055 | 1.491e+15 | 5.970e-01 |

| 27 | 775928 | 1.527e+15 | 1.540e-01 |

| 28 | 5131589 | 7.213e+14 | 7.545e-02 |

| 29 | 11419771 | 1.660e+14 | 5.831e-02 |

| 30 | 10035639 | 1.946e+13 | 6.439e-02 |

| 31 | 3929868 | 1.229e+12 | 9.756e-02 |

| 32 | 749034 | 4.391e+10 | 2.019e-01 |

| 33 | 77082 | 9.586e+08 | 5.705e-01 |

| 34 | 4476 | 1.310e+07 | 2.192e+00 |

| 35 | 149 | 1.121e+05 | 1.172e+01 |

| 36 | 1 | 2.090e+02 | 1.000e+02 |

+----+------------+--------------+----------------+

Taking the best of the superset/subset (which is which? you call them both subset) results for each number of clues, we get:

Total mean number of minimal puzzles per grid =

(+ 4.625e-01 2.219e+03 3.231e+06 1.290e+09 1.615e+11 6.903e+12 1.064e+14 6.247e+14 1.484e+15 1.526e+15 7.213e+14 1.660e+14 1.946e+13 1.229e+12 4.391e+10 9.586e+08 1.310e+07 1.121e+05 2.090e+02)

= 4.6562e+15

Very close to my own estimate, 4 years ago (4.6553e+15, p. 43 of the pdf here http://forum.enjoysudoku.com/the-real-distribution-of-minimal-puzzles-t30127-7.html)

- Code: Select all
`Error =`

(+

(* 4.625e-01 3.536e+01)

(* 2.219e+03 3.536e+01)

(* 3.231e+06 1.368e-01)

(* 1.290e+09 2.925e-02)

(* 1.615e+11 1.146e-02)

(* 6.903e+12 6.447e-03)

(* 1.064e+14 4.798e-03)

(* 6.247e+14 5.106e-03)

(* 1.484e+15 9.373e-03)

(* 1.526e+15 3.568e-02)

(* 7.213e+14 7.545e-02)

(* 1.660e+14 5.831e-02)

(* 1.946e+13 6.439e-02)

(* 1.229e+12 9.756e-02)

(* 4.391e+10 2.019e-01)

(* 9.586e+08 5.705e-01)

(* 1.310e+07 2.192e+00)

(* 1.121e+05 1.172e+01)

(* 2.090e+02 1.000e+02)

)

= 137587748084193.0, which gives an interval [4.6548 4.6576]e+15

relative error: 0.0295 %

My estimate had relative error 0.065%

So, yours is apparently half of mine. But this is without counting the unknown error due to your relatively small sample of complete grids - I used all the complete grids, several times, and I had therefore no such error.

Your computation time: 15x240 hours = 5 months. If we divide by 4 (in order to allow a double sampling error), this gives 1.25 CPU-months.

I don't have exact figures for the time it took me to generate my 5.9 million uncorrelated puzzles with the controlled-bias generator (which, at the start, was not optimised - the last version was 200 times faster than the first). In any case, the count was in CPU-months, not in years. So, the benefit (if any, i.e. if not spoiled by sampling errors of the complete grids) is not orders of magnitude.

2) Estimating other random variables

Now, suppose you want to compute the mean SER of minimal puzzles, with the same relative precision as with the controlled-bias generator. You will have to use (at least) 1/4 of your puzzles, i.e. 13,676,903,000/4, i.e. more than 3 billion, when I need only about 6 million - 500 times less.

The same problem will appear every time you want to estimate the mean of any random variable.

(Indeed, these figures can be lowered, in both methods, by concentrating on the main part of the distribution. But this doesn't change the computation time ratio.)