How to compute unbiased statistics

Considering some recent discussions about unbiased statistics in the "grey zone" and the "JExocet" threads, it seems that some of the results reached in this old lost thread and some of their obvious consequences have been forgotten.

For a summary of the essential results of this thread, see the references below.

Note: the following applies to the whole collection of minimal puzzles. It can obviously not be extended to a restricted collection defined by particular properties, such as having a rating larger than some predefined value (wrt to some predefined rating) or having some predefined pattern.

Main result

The main result of this thread was, we now have a close estimate for the real distribution of minimal puzzles wrt to their number of clues (as a result, we also have a close estimate for the total number of minimal puzzles). Let me recall the distribution here for ease of use.

- Code: Select all
`#clues %puzzles`

20 1.32E-7

21 0.000034

22 0.00348

23 0.148

24 2.28

25 13.42

26 31.91

27 32.71

28 15.48

29 3.598

30 0.41

31 0.0241

32 0.00102

these results are valid with overall relative precision 0.065%

see page 43 of the pdf copies of the old forum for more detailed results

Puzzles outside the [20, 31] range can be neglected in any statistical study bearing on the whole collection of minimal puzzles (e.g. the frequency of puzzles having some pattern).

Let pk be the proportion of puzzles with k clues, according to this estimate.

Practical applications

As of today, we are still unable to generate unbiased uncorrelated collections of puzzles.

The best we can generate remains uncorrelated collections with controlled-bias (this is how the above results were obtained), and such generation remains very time consuming (about 257,000 times longer than top-down generation).

However, do we need a controlled-bias collection to produce unbiased statistics? NO. The above results are valid once and for all and they can be used as such.

Of course, the closer to unbiased the better and controlled-bias is the closest we currently have.

But suppose we have another uncorrelated collection, generated e.g. by a top-down generator (whose bias is much larger than the controlled-bias) or a bottom-up one (whose bias is still much larger).

Suppose we are interested in some random variable X, e.g.:

- X = 1 if the puzzle has an sk-loop, 0 otherwise

- X = 1 if the puzzle has a J-Exocet, 0 otherwise

- X = number of J-Exocets in the puzzle

- X = number of eliminations by J-Exocets in the puzzle after some set of other rules (e.g. SSTS) has been applied

and suppose we want to estimate the real mean of X.

If, instead of simply taking the mean for all the puzzles in the collection, we use a weighted mean (using the pk as weights), we get an estimate of the unbiased mean value of X for the minimal puzzles. More precisely, let E(Xk) be the mean value of X for the puzzles with k clues in the collection. Then the unbiased estimate of the mean E(X) of X is merely:

E(X) = E(X1)*p1 + E(X2)*p2 + ...

Another useful quantity is the standard deviation sd(X) of X (it can be used as an informal measure of the precision of the estimate of E(X)). It is given by

sd(X)^2 = sd(X1)^2*p1 + sd(X2)^2*p2 + ...

This is not a panacea, especially if the largest values of X occur with puzzles having a small or a large number of clues.

These formulae also show that, when studying X, it is generally useless to analyze the full collection at disposal. Analysing millions of puzzles with low probability numbers of clues will not improve the results. It is more efficient to randomly extract a sub-collection.

Concerning JExocets, for the puzzles in the "potential hardest" collection, almost all of them are found for puzzles with k = 22, 23, 24 clues.

If this remained true for a top-down generated collection, this would be enough to entail that JExocets are present in at most 2% of the minimal puzzles.

(Indeed, my own estimates, based on other calculations, put it very close to 0%).

References:

- "Unbiased Statistics of a CSP - A Controlled-Bias Generator", International Joint Conferences on Computer, Information, Systems Sciences and Engineering (CISSE 09), December 4-12, 2009, Springer. pdf preprint. Published as a chapter of the book Innovations in Computing Sciences and Software Engineering, Khaled Elleithy Editor, pp. 11-17, Springer, 2010, ISBN 97890481911133.

- "Constraint Resolution Theories", Lulu.com, Oct. 2011, ISBN : 978-1-4478-6888-0

- "Pattern-Based Constraint Satisfaction", chapter 6, Lulu.com, Nov. 2012, ISBN 978-1-291-20339-4

[Edit 2013/06/20: added precision of the distribution and reference to p. 43]