Hi
Afmob,
First of all, congratulations for the interesting results about the extreme numbers of clues. As I noticed long ago, high values are out of reach of the controlled-bias generator. This may not be true of low values, if sufficiently many clues are eliminated from the start (this was the first optimisation I found, and it may be pursued further for small numbers of clues).
Afmob wrote:In contrast to my former computations I do not only use one small sample size of 100k grids but using Red Ed's unbiased generator some samples (100k for subset method, 10k for superset method) are produced with each sample being used 10 times. Afterwards, new samples are generated. Therefore the relative errors are slightly underestimated but only by a small margin.
hopefully "small", but not calculated.
1) Mean number of minimal puzzles per complete gridAfmob wrote:Subset results:
- Code: Select all
Computation time: 15x240 hours
601,942,000,000 samples
1,039,057,869 valid 28 hits
13,676,903,000 valid minimal puzzles
+----+------------+--------------+----------------+
| Cl | Count | E(nr/grid) | E(rel err)*100 |
+----+------------+--------------+----------------+
| 18 | 8 | 4.625e-01 | 3.536e+01 |
| 19 | 6092 | 2.219e+03 | 1.518e+00 |
| 20 | 1287655 | 3.231e+06 | 1.368e-01 |
| 21 | 67409525 | 1.290e+09 | 2.925e-02 |
| 22 | 984980363 | 1.615e+11 | 1.146e-02 |
| 23 | 4280735451 | 6.903e+12 | 6.447e-03 |
| 24 | 5689364460 | 1.064e+14 | 4.798e-03 |
| 25 | 2343546179 | 6.247e+14 | 5.106e-03 |
| 26 | 298326641 | 1.484e+15 | 9.373e-03 |
| 27 | 11149692 | 1.526e+15 | 3.568e-02 |
| 28 | 96934 | 7.163e+14 | 3.212e-01 |
+----+------------+--------------+----------------+
Supset results:
- Code: Select all
21,310,000,000 samples
1,199,199,372 minimal 26 hits
32,151,592 valid minimal puzzles
+----+------------+--------------+----------------+
| Cl | Count | E(nr/grid) | E(rel err)*100 |
+----+------------+--------------+----------------+
| 26 | 28055 | 1.491e+15 | 5.970e-01 |
| 27 | 775928 | 1.527e+15 | 1.540e-01 |
| 28 | 5131589 | 7.213e+14 | 7.545e-02 |
| 29 | 11419771 | 1.660e+14 | 5.831e-02 |
| 30 | 10035639 | 1.946e+13 | 6.439e-02 |
| 31 | 3929868 | 1.229e+12 | 9.756e-02 |
| 32 | 749034 | 4.391e+10 | 2.019e-01 |
| 33 | 77082 | 9.586e+08 | 5.705e-01 |
| 34 | 4476 | 1.310e+07 | 2.192e+00 |
| 35 | 149 | 1.121e+05 | 1.172e+01 |
| 36 | 1 | 2.090e+02 | 1.000e+02 |
+----+------------+--------------+----------------+
Taking the best of the superset/subset (which is which? you call them both subset) results for each number of clues, we get:
Total mean number of minimal puzzles per grid =
(+ 4.625e-01 2.219e+03 3.231e+06 1.290e+09 1.615e+11 6.903e+12 1.064e+14 6.247e+14 1.484e+15 1.526e+15 7.213e+14 1.660e+14 1.946e+13 1.229e+12 4.391e+10 9.586e+08 1.310e+07 1.121e+05 2.090e+02)
=
4.6562e+15Very close to my own estimate, 4 years ago (
4.6553e+15, p. 43 of the pdf here
http://forum.enjoysudoku.com/the-real-distribution-of-minimal-puzzles-t30127-7.html)
- Code: Select all
Error =
(+
(* 4.625e-01 3.536e+01)
(* 2.219e+03 3.536e+01)
(* 3.231e+06 1.368e-01)
(* 1.290e+09 2.925e-02)
(* 1.615e+11 1.146e-02)
(* 6.903e+12 6.447e-03)
(* 1.064e+14 4.798e-03)
(* 6.247e+14 5.106e-03)
(* 1.484e+15 9.373e-03)
(* 1.526e+15 3.568e-02)
(* 7.213e+14 7.545e-02)
(* 1.660e+14 5.831e-02)
(* 1.946e+13 6.439e-02)
(* 1.229e+12 9.756e-02)
(* 4.391e+10 2.019e-01)
(* 9.586e+08 5.705e-01)
(* 1.310e+07 2.192e+00)
(* 1.121e+05 1.172e+01)
(* 2.090e+02 1.000e+02)
)
= 137587748084193.0, which gives an interval [4.6548 4.6576]e+15
relative error: 0.0295 %
My estimate had relative error 0.065%
So, yours is apparently half of mine. But this is without counting the unknown error due to your relatively small sample of complete grids - I used all the complete grids, several times, and I had therefore no such error.
Your computation time: 15x240 hours = 5 months. If we divide by 4 (in order to allow a double sampling error), this gives 1.25 CPU-months.
I don't have exact figures for the time it took me to generate my 5.9 million uncorrelated puzzles with the controlled-bias generator (which, at the start, was not optimised - the last version was 200 times faster than the first). In any case, the count was in CPU-months, not in years. So, the benefit (if any, i.e. if not spoiled by sampling errors of the complete grids) is not orders of magnitude.
2) Estimating other random variablesNow, suppose you want to compute the mean SER of minimal puzzles, with the same relative precision as with the controlled-bias generator. You will have to use (at least) 1/4 of your puzzles, i.e. 13,676,903,000/4, i.e. more than 3 billion, when I need only about 6 million - 500 times less.
The same problem will appear every time you want to estimate the mean of any random variable.
(Indeed, these figures can be lowered, in both methods, by concentrating on the main part of the distribution. But this doesn't change the computation time ratio.)