Hi
Blue,
blue wrote:The only caveat, is that you besides having the puzzles, you need to have a "batch number" associated with each puzzle,
and you need to use it in calculating the error estimates (but not the value estimates). The "batch number" would indicate which (G,S) sample the puzzles were from. Alternatively, the puzzles can just be grouped into sets coming from the same (G,S) sample. A sequential list with "start" markers would suffice.
I need a unique mean "batch number" for the whole collection; in any case, I don't see how it could be a caveat.
In case you want to use sub-samples instead of the whole collection, you cannot cut the latter anywhere, you need these "start markers" and you need the "batch number" for the sub-collection; but, here again, I can't see any problem.
blue wrote:if you've looked at the beginnings of Red Ed's post, you can see that for producing an error estimate, the final word will be that the standard deviation of the estimates goes like sqrt(Var(V) / K) ... or if you like, sqrt(Var(V) / s(K-1)). The only question is: what is Var(V) for a given subset size, and statistic of interest ?
Yes
blue wrote:there are two different things that we can talk about
I agree, we shouldn't confuse generating a collection of puzzles (1) and using it (2).
Generating a sample is a one-shot business; using it is an iterative one: every time you want to study a new random variable, you can re-use the same sample.
(1) Generationblue wrote:how the collection of puzzles was generated in the first place. Having some experience using both methods, I can tell you that Red Ed's method is capable of generating a much larger colletion of puzzles in a fixed time, than the "controlled bias" method -- which I'ld prefer to call the "known bias" method.
Comparing generators with different properties is meaningless in my view. My generator produces uncorrelated samples. Comparing it with Red Ed's may be more meaningful than comparing it with top-down, but not much more.
The expression "controlled-bias" has a history, I defined it in opposition to unknown bias: by adapting a top-down algorithm, we can control the bias in such a way that it becomes known.
blue wrote:I've always been curious how many CPU core hours it took, to generate the collection that you produced. I'm guessing something like 100 (2 GHz) Macs for a month. Would you mind giving a number (an estimate if necessary) ?
I'd like to have 100 Macs (especially the forthcoming new ones
http://www.apple.com/mac-pro/ - free ad. Donations accepted.)
I never made a secret of it, but I don't have a precise estimate. I started using the generator before drastic improvements were made.
As the controlled-bias algorithm is a simple modification of top-down, it could easily be implemented by
eleven, starting from suexg (as I now remember it, there were only one or two lines of code changed). There was much collaborative work to improve it (see chapter 6 of my last book for a list of participants). The final version was ~ 200 times faster than the first. As these improvements couldn't change the results but only speed, I kept all the puzzles generated by all the successive versions.
With these preliminaries, the whole generation process took several weeks (between 6 and 8, I can't remember). At the start, I used only my MacPro 2006. After some time, I also used Linux computers from my lab, up to ~ 100 cores during the weekends; during workdays, I could only use fewer cores and the job was assigned low priority; I have no idea which % of the cores was used.
So, the estimate remains rather vague, but that's all I have.
(2) Useblue wrote:I wouldn't deny that if the collection was generated by Red Ed's method, then you might need a larger sample to achieve the same error estimate -- emphasis on the might. When the statistic is something like an SE rating, then every puzzle in a batch, contributes, and 1) calculating SE ratings is expensive, and 2) the average batch size might be large, depending on the size of S.
Most of the stats I've been interested in fall in this category.
(3) Confusing generation and useRed Ed wrote:Denis, stop celebrating, I'm not done yet
Would you mind not trying to turn this again into a personal affair?
Red Ed wrote:The reason that the correlated internals of S (the subtree) don't hurt much is that we can search within S very efficiently. A 24-clue subtree for 21-clue minimals contains 2024 21-clue subsets: that's the "equivalent number" mentioned above. But we can search that subtree at a cost of substantially fewer than 2024 suexg-cb type probes -- maybe 2000 times quicker, if the 24-clue "root" has >1 solutions (because then there's nothing more to check). It's this algorithmic efficiency in searching the tree that offsets the negative effect of correlations internal to S,
The term "offsets" relies on a confusion between generation and use. You implicitly consider you have to generate a new collection every time you use it.
But the more you use the collection, the more the negative effects of correlation will be appear.
Red Ed wrote:ultimately making subtree and supertree methods the best choice in certain (but not all) parts of the distribution.
I guess you are speaking of the number of clues. As far as I can remember, your generation method was more efficient (for this very particular random variable) in the upper part of the distribution, where probability is close to 0.