The New Sudoku Players' Forum

Website · by **denis_berthier** » Sun Jun 09, 2013 1:10 pm

Red Ed wrote:
denis_berthier wrote:you don't know how the standard deviation varies with the number of puzzles generated in one subtree.
I think we need to take a step back here. The random variable f(S) is an unbiased estimator for a thing we'd like to count (puzzles, resolution rules, whatever). We'd like to know the mean and variance of f(S), so we do that empirically by gathering a load of sample data and computing the sample mean and variance. That is the statistically correct thing to do, irrespective of what might be going on "internal" to S.

I dare say you don't know how the standard deviation of SER various with the number of U4s in the solution grid (to pick an arbitrary factor that might influence your Xk variables), but that's "internal" to Xk and has no bearing on the correctness of your calculations based on observed values of Xk.

Actually, we should take more steps back. Referring to the summary post I wrote for practical purposes: http://forum.enjoysudoku.com/the-real-distribution-of-minimal-puzzles-t30127-8.html, where are you heading ?

by **Red Ed** » Sun Jun 09, 2013 1:44 pm

I wished only to make clear that subtree enumeration methods were a valid approach to these sorts of estimation problems, correcting what appeared to be a message to the contrary. I was heading only as far down that road as I felt necessary.

Website · by **denis_berthier** » Sun Jun 09, 2013 2:05 pm

In this case, it'd have been simpler to point to a self-contained reference where the method is defined and your claims are proven.
In any case, I don't see the relation with my summary post.

by **Red Ed** » Sun Jun 09, 2013 3:42 pm

If you need a proof of any part of my response to dobrichev and, as you challenged my response, to you, then I would be happy to oblige with a short post on this thread. And the relevance to your summary post? There needn't be any, as I was responding to a different post entirely, but nonetheless: the relevant point is that there is more than one way to skin a cat; suexg-cb isn't, and never has been, the only game in town.

btw, when I was in B&Q an hour ago, Blondie's Denis was playing on the radio. I had a little chuckle to myself

Website · by **denis_berthier** » Sun Jun 09, 2013 4:12 pm

My summary post was about how to use the unbiased distribution of clues, not about how it was obtained.

As for your method, if, after all this time, you don't have a self-contained reference where it is fully defined and your claims are proven, this is the end of this discussion for me.

by **blue** » Mon Jun 10, 2013 4:51 am

Any proof here, is too trivial to bother with.

The formulas below, are exact.
Convince yourself that the first one is correct -- use the legend (and the scrollbar).
The rest follow from simple algebra, the definitions, and basic statistics.

subtree_enumeration.jpg

To get an n(c) estimate using K samples, replace m_avg(s,c) by the sample average.
To get an error estimate, replace m_std_dev(s,c) by the standard deviation for the sample.
Multiply the result by 1.96 to get the half width for a 95% confidence interval.

If you want to estimate something like the total number of c-clue puzzles that are solvable
by singles, then you use the same procedure, but define m(G,S,c) to be the number of
puzzles solvable by singles, that have solution grid G, and 'c' givens chosen from cell set S.

P.S. I've used this same kind of technique before, to estimate the average number of
mniimal UA sets of a given size, in a "typical" solution grid -- by estimating the numbers
of (grid, minimal UA in the grid) pairs for a range of UA set sizes, and dividing by the total
number of grids. There are no proofs here either, but a post here, showing some initial
results, and a followup here -- both on the Programmer's Forum.

Website · by **denis_berthier** » Mon Jun 10, 2013 5:58 am

blue wrote:Any proof here, is too trivial to bother with.

This is surprising on your part.

Red Ed wanted to recall his technique for generating c-clue puzzles; as far as I can see, there's nothing new wrt our old discussion.

Apart from this, I don't see the relation with my summary post about how to easily use c-clue stats to get unbiased stats, using the known unbiased distribution of clues.

Just seeing your P.S.: good to see the method can be used.

by **Red Ed** » Mon Jun 10, 2013 6:42 am

Thank you, blue, for adding your voice.

Denis> As for your method, if, after all this time, you don't have a self-contained reference where it is fully defined and your claims are proven, this is the end of this discussion for me.
OK, I'll open a new thread. My first post describing the subsets method was in a thread that died when the server crashed (it was dated just a few days before you opened the RDMP thread). Since you like self-contained references, I'll put the supersets method in there as well. Maybe as a result you will understand well enough to stop dismissing such techniques and discouraging contributions from the likes of dobrichev.

But, work first. Let me know in the mean time if there's anything in particular you'd like to see in the self-contained reference.

Website · by **denis_berthier** » Mon Jun 10, 2013 7:21 am

Red Ed wrote:Denis> As for your method, if, after all this time, you don't have a self-contained reference where it is fully defined and your claims are proven, this is the end of this discussion for me.
OK, I'll open a new thread.

Very good idea.

Red Ed wrote:Let me know in the mean time if there's anything in particular you'd like to see in the self-contained reference.

Anything defining the method unambiguously, not just a sequence of obvious formulæ from combinatorics.

How you define the various parameters and adjust them for obtaining some standard deviation:
- the number of solution grids you consider;
- the number of puzzles you generate for each solution grid;
- ...

An explanation of how these choices are related and of why they don't introduce a bias.
A proof that the puzzles you build from a single solution grid are uncorrelated. (Needless to say, I don't consider that programming a formula and getting numerical results from it makes a proof.)

by **blue** » Mon Jun 10, 2013 8:57 am

Hi Denis,

denis_berthier wrote:A proof that the puzzles you build from a single solution grid are uncorrelated.

They are, of course, correlated -- they all have the same solution, if nothing else.
The correlations don't introduce a bias, though.
The worst they can do (if they're actually present for the statistic of interest) is produce
m_std_dev(s,c) numbers that are larger that they might otherwise be.
Those numbers don't figure into the actual estimates for the n(c) numbers, though -- only
into the error estimates. That can be countered with a larger sample size, if nothing else.

It's my feeling that that's as close to a proof as you're going to see.
I've taken note of your slights at my list of equations, but the knowledge you seek is there to see.
An argument that the correlations would lead to a bias, seems to fall to you.
I wouldn't recommend trying to produce one, but there is often much to learn from failure.

Best Regards,
Blue.

Website · by **denis_berthier** » Mon Jun 10, 2013 9:36 am

Hi Blue,

blue wrote:
denis_berthier wrote:A proof that the puzzles you build from a single solution grid are uncorrelated
They are, of course, correlated.

Glad to hear it, at last !!!

blue wrote:The correlations don't introduce a bias, though.

I didn't say they did.

blue wrote:The worst they can do (if they're actually present for the statistic of interest) is produce
m_std_dev(s,c) numbers that are larger that they might otherwise be.

We are getting close to the point !

blue wrote:Those numbers don't figure into the actual estimates for the n(c) numbers, though -- only
into the error estimates. That can be countered with a larger sample size, if nothing else.

Now, we're at it. How much larger is the whole question and I don't believe there can be any general answer independent of the random variable under study.
Whence my list of questions about adjusting the various parameters.
Note that having to increase the sample size may also have drastic computational consequences if the random variable of interest is hard to compute.

blue wrote:I've taken note of your slights at my list of equations, but the knowledge you seek is there to see.

Apologies if I gave you this feeling. I don't underestimate your equations; but in my view equations come after theory.
If the puzzles are correlated, I don't see how you could compute standard deviations in a way that allowed to adjust the parameters in advance.

by **blue** » Mon Jun 10, 2013 12:43 pm

Hi Denis,

denis_berthier wrote:
blue wrote:Those numbers don't figure into the actual estimates for the n(c) numbers, though -- only
into the error estimates. That can be countered with a larger sample size, if nothing else.

Now, we're at it. How much larger is the whole question and I don't believe there can be any general answer independent of the random variable under study.
Whence my list of questions about adjusting the various parameters.
Note that having to increase the sample size may also have drastic computational consequences if the random variable of interest is hard to compute.

I agree about the general answer. There may be an upper bound, but if so, it's probably much
larger than would be encountered in any practical situation.
For the rest, I'll should defer to Red Ed.

denis_berthier wrote:If the puzzles are correlated, I don't see how you could compute standard deviations in a way that allowed to adjust the parameters in advance.

Here, we need to be on the same page, as far as what kinds of standard deviations we're talking about.

The one I mentioned above, was a standard deviation in an estimate of the value of a constant times
the actual average of very long (but fixed) list of numbers, based on a random sampling of those numbers.
The formula had as a parameter, which was the actual standard deviation of the numbers in the very
long list. In practice, we can't compute that, any more than we can compute the average, but we can
estimate it, by calculating the standard deviation for the sampled numbers.

There is another kind of standard deviation -- the standard deviation in the average number of clues
in a minimal puzzle, for example -- that we can calculate exactly, if we know the actual numbers of
puzzles of each size ... the actual n(c) numbers. We won't be able to calculate the actual numbers
(as always), but instead, we can use the n(c) estimates in the same calculation.

When we do that, we might like an error estimate for the standard deviation that we've calculated.
I guess I won't go into the details of how that can be done -- maybe you're already familiar with the basic ideas.
The details will depend on whether the n(c) estimates were calculated from the same set of (G,S)
samples, or different sets. If it's different sets (not very efficient), then the result will depend only
on the n(c) estimates, and thier error estimates. If it's the same set, then correlations between the
n(c) errors are also important. In that case, if we want the "best" error estimate, then besides
calculating sample averages for m(G,S,c) and m(G,S,c)*m(G,S,c) for the various c's, we'ld also need
sample averages for products like m(G,S,c1)*m(G,S,c2).
Variances and covariances, or standard deviations and correlation coefficients, can replace the
product averages, and vica-versa -- I'm assuming you're familiar with all of that.
Note: here too, correlations aren't figuring into the actual calculation of the estimated standard
deviation, but only into the calculation of the estimated error.

Regards,
Blue.

Website · by **denis_berthier** » Mon Jun 10, 2013 3:23 pm

Hi Blue,

Of course, there's no problem for the estimate of E(X), but knowing such an estimate without knowing its "uncertainty" / "precision" / "error" is not very useful in my view.

Using std(X) as an approximation of the "uncertainty" / "precision" / "error" of this estimate of X is merely a convention; as you know, one can use confidence intervals and confidence levels if one wants to be more precise; but this is equivalent. In the sequel, I'll therefore concentrate on the only mathematical notion, std(X).
In this context, I can make no sense of your sentence:

Blue wrote:Note: here too, correlations aren't figuring into the actual calculation of the estimated standard
deviation, but only into the calculation of the estimated error.

Now, I'd like to clarify the following points.

Consider a random variable X (e.g. the W rating of a randomly chosen puzzle or the presence of an sk-loop in a randomly chosen puzzle).
Consider also a set of n independent instances Xk of X (associated with different puzzles Pk).
One can estimate the mean and the standard deviation of X from the sample made of the n Xk's. In particular, there's a full fledged theory for this independent case and there are known unbiased estimators for E(X) and almost unbiased for std(X)* (given by the usual formula with sqrt(1/(n-1)) for std(X)).
(*) There is a really unbiased estimator for var(X).

In my controlled-bias approach, all the puzzles I generate are independent and I can fully rely on this theory to estimate the mean and standard deviation of any random variable. All the computations are straightforward.
In my summary post (http://forum.enjoysudoku.com/the-real-distribution-of-minimal-puzzles-t30127-8.html), I mentioned that, now that the distribution of clues is known, top-down generators could also be used, provided that the corrections are made as given by the formulæ I wrote. In the JExocet thread, I showed how this can be applied in practice (http://forum.enjoysudoku.com/jexocet-pattern-defintion-t31133-213.html)

Now, what's the difference with Red Ed's situation? He has a set of correlated (and therefore not independent) instances of X. From this point, there are two possibilities:
- either you compute an "empirical standard deviation" using the same formula as if the Xk's were independent - but this is where words are delusory: it is by no means an unbiased (nor even almost unbiased) estimator of the real std(X); it can even be grossly wrong if there is a strong correlation;
- or, for any possible variable X, you need to define an (almost) unbiased estimator for sdt(X), prove that it is (almost) unbiased and, most of all, compute how it will vary with the size of your sample; until now I've not seen any of this in Red Ed's approach and I doubt I'll ever see any of this.

by **champagne** » Mon Jun 10, 2013 5:44 pm

I am not by far an expert in statistics, but I follow with interest that discussion.

In one of the last links provided by Denis, I can read that

denis_berthier wrote:
Estimated unbiased proportion of puzzles having an exocet in the SER > 8.6 area
....
= 0.018%

This is a small proportion (less than 2 in 10,000), although this is more than I expected. The reason is, there are more cases than expected with more than 24 clues.

I have a small problem

Here is the distribution of ratings in the sample (assumed unbiased) file (read in the first column 8.6 for 86)

Code: Select all: 86 28374 87 21913 88 86807 89 101911 90 41203 91 2219 92 158 93 3 282588

In my view, such a sample has a null value to draw any conclusion for ratings over say 9.0

what is experts view on that

Website · by **denis_berthier** » Tue Jun 11, 2013 3:52 am

champagne wrote:Here is the distribution of ratings in the sample (assumed unbiased) file (read in the first column 8.6 for 86)

Code: Select all
86 28374 87 21913 88 86807 89 101911 90 41203 91 2219 92 158 93 3 282588

In my view, such a sample has a null value to draw any conclusion for ratings over say 9.0

Most of the polls you read in the newspapers rely on samples < 1,000.
As precision of the results varies like sqrt(1/sample_size) for uncorrelated samples, significant improvements in precision require very large increases in size. That's why small samples are generally used, unless precision is absolutely necessary for some reason.

If you truncate your sample at 9.0, you still have more than 43,000 puzzles.
If you truncate it at 9.1, you still have more than 2,300 puzzles; this doesn't provide high precision, but it's better than nothing.
If you tried to truncate it at 9.2, I don't think it would keep much meaning.

The New Sudoku Players' Forum

The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles

Re: The real distribution of minimal puzzles