The Sudoku grey zone

Advanced methods and approaches for solving Sudoku puzzles

Re: The Sudoku grey zone

eleven wrote:No, the tamagochi method would produce the same bias, because basically it uses the same method of neighbourhood search with filtering "easy" puzzles (it was just useful, that the criteria for expanding/filtering or adding new sets could be dynamically changed during the generation process).
The bias then comes from the fact, that puzzles in big "clouds" of hard puzzles with similar properties (!) will be over-represented, while others with a small number of toughies in their neigbourhood are not found. So its not the best choice to start with the hardest set to generate grey zone puzzles, if you want to know the commonness of Exocet patterns (because we know they are the more common, the harder the puzzles are).
Probably better would be a bottom-up generation with a limited number of neighbourhood extensions, but then the effort to get a set of the same size would be a multiple higher.

I wasn't proposing to start from the hardest collection - for the same reasons as you, I think the bias would be much too strong and no credible conclusion could be drawn as to patterns having high frequency in the hardest.
As far as I can remember, when you generated your hardest list you started from a random set. Couldn't the same process be applied with a different target (SER = x.x instead of SER as high as possible)? I'm not saying the resulting collection wouldn't be biased, but at least it would avoid the bias that one would have if they started with the hardest.

champagne wrote:it is somehow strange to read at the same momenb that some would know more about the grey zone but are not prepared to spend a penny (nor one cycle) to do it.

I could spend cycles on an old computer to run an existing program, but I have no time for writing one (and no longer any competence for writing a fast one, as may be necessary here).

Just as a reminder of how much different goals can change complexity and computation times, generating one puzzle with the controlled-bias generator required (in the mean) more than 250,000 times the effort to generate a puzzle with a top-down generator. It took me months of CPU to generate about 6,000,000 (which can be done in a few seconds if no goal for bias is set).
denis_berthier
2010 Supporter

Posts: 1253
Joined: 19 June 2007
Location: Paris

Re: The Sudoku grey zone

denis_berthier wrote:Couldn't the same process be applied with a different target (SER = x.x instead of SER as high as possible)?

Yes, of course, but the bias would become stronger with each follow-up neighbourhood search, so very much seeds should be taken - and most of them probably will not show a ER 9+ puzzle in the neighbourhood of a small number of say {-1+1} searches (while searching the neigbourhood of already hard puzzles will generate more and more).
So i am with champagne. Also a set (quickly) generated from the hardest will bring useful results (and maybe the exocets are already that rare in this set - or the eliminations so weak, that no further investigation is interesting).
eleven

Posts: 1670
Joined: 10 February 2008

Re: The Sudoku grey zone

eleven wrote:a set (quickly) generated from the hardest will bring useful results (and maybe the exocets are already that rare in this set - or the eliminations so weak, that no further investigation is interesting).

I agree that it can bring this kind of negative results and it may therefore be a useful preliminary step, but no positive result could have any degree of credibility.
denis_berthier
2010 Supporter

Posts: 1253
Joined: 19 June 2007
Location: Paris

Anomalies in the hardest list

When we compare the origin of the puzzles in champagne's hardest list with the number of patterns of some type they contain, some noticeable differences appear. (Reference: classification 2012/12/24)

Number of puzzles declared has having some pattern (E, EE, sk*)
*: champagne renames this V-loop but this is what everyone knows as sk-loop.
Code: Select all
`creator ->        GP       dob       elev        col        tarekpattern E              163413     260606     3726        460         34EE              50675      85163     1253         32          1sk               3281      11040       29        101        240`

What's interesting is the ratio of some pattern (EE, sk) wrt Exocet:

Code: Select all
`creator ->        GP        dob       elev      col         tarekpattern EE/E*100          31         33        34         7            3sk/E*100           2          4       0.8        22          706`

Which conclusions to draw is not clear but there's a very strong bias somewhere.
denis_berthier
2010 Supporter

Posts: 1253
Joined: 19 June 2007
Location: Paris

Re: Anomalies in the hardest list

denis_berthier wrote:When we compare the origin of the puzzles in champagne's hardest list with the number of patterns of some type they contain, some noticeable differences appear. (Reference: classification 2012/12/24)

I really don't see what kind of information can give the split per "creator".

We all know that at the start, providers selected by hand some files and worked mainly in the 20-22 clues area.
The SK loop pattern is requiring four empty cells in rectangle in four boxes. This was relatively common in the patterns selected by hand.
Surely the SK loop pattern frequency has been over estimated at the start.

What I see with not much changes since the file has a reasonable size is

a frequency of the exocet in these SER ratings slightly below 80%
a surprising high frequency of the double exocet pattern over 20%
a slightly decreasing frequency of the SK loop pattern below 3%
and the rank 0 logic (but we are far from having covered the potential) around 6.5%

My personal intimate conviction is that we have now in the file a huge majority of "potential hardest puzzles" with less than 24 clues. That does not give that much room for a big bias.

In the 24 25 clues area, there is still room for a significant deviation from the total frequency, but the process applied is not so biased and I would not be surprised if we had already more than 50% of the potential hardest of 24 clues
champagne
2017 Supporter

Posts: 5753
Joined: 02 August 2007
Location: France Brittany

Re: Anomalies in the hardest list

champagne wrote:I really don't see what kind of information can give the split per "creator".

Many kinds (what I did above is the typical kind of consistency checks one makes when having multiple sources of information for the same topic and before fusing them):

- coloin and tarek are strongly biased for sk-loops;
- they are also strongly biased for the EE/E proportion and therefore probably also for E;
(these are the most likely hypotheses; the alternative being that GP, dob, elev are the strongly biased ones);

- at least one of GP, dob, elev is strongly biased for sk-loops; is the real frequency close to 0.8 or to 4 (a ratio 1 to 5 cannot be neglected); as, AFAIK, GP, dob and elev used similar generation techniques, it is strange that they get similar EE/E ratios but so different sk/E ratios.

I don't comment personal opinions.
denis_berthier
2010 Supporter

Posts: 1253
Joined: 19 June 2007
Location: Paris

Re: The Sudoku grey zone

Its no surprise that the puzzles that we found when Easter Monster came out - a large proportion were found to have the SK-loop.

Easter Monster was made by taking the 16-clue backbone [ everything but box 5]
Generate the potential grid completions with a full box 5
Remove clues from box 5

All the puzzles will have the sk-loop - the 16-clue back bone has a significantly reduced solution count.

Tarek and I generated subsequent puzzles with a -1+1 until overwhelmed !

Probably most of the puzzles with an SK-loop are known.

C
coloin

Posts: 1662
Joined: 05 May 2005

Re: The Sudoku grey zone

Hi coloin,

So this explains the sk/E discrepancy.
Do you have any idea about the EE/E discrepancy ?
denis_berthier
2010 Supporter

Posts: 1253
Joined: 19 June 2007
Location: Paris

Re: The Sudoku grey zone

coloin wrote:Probably most of the puzzles with an SK-loop are known.

C

Hi coloin,

There is an evidence that any new entry in the data base must be new towards previous ones, so if a deep search on patterns eligible to the sk loop has been made, next entries must have a lower ratio.

This is correct

Nevertheless, the SK loop can also be found with 24 25 clues, with less chances to get the favourable pattern.

In my last run, for pending new entries, I have roughly 1% of puzzles having the sk loop
champagne
2017 Supporter

Posts: 5753
Joined: 02 August 2007
Location: France Brittany

Re: The Sudoku grey zone

Well i hadnt realized that they could be present with 25 clues.
The first puzzles which had exocets were found by chance and we wernt really aware. A vicinty search on these puzzles by more recent contributers probably explains why they are much more frequent.
C
coloin

Posts: 1662
Joined: 05 May 2005

Re: Anomalies in the hardest list

denis_berthier wrote: AFAIK, GP, dob and elev used similar generation techniques, it is strange that they get similar EE/E ratios but so different sk/E ratios.

I refound a good part of tarek's puzzles, so my real (overall) ratio sk/E was higher.
[Added:] To be more precise: In my summary 2 years ago i wrote:
[i found] Almost 90% of the known ER 11+ puzzles (independantly - only 92 of them passed my rating filters)
eleven

Posts: 1670
Joined: 10 February 2008

Re: The Sudoku grey zone

So, we now have a better idea of what's in the "potential hardest" list.

If using it as a starting point for a vicinity search for any pattern with lower SER, say JExocet (or Exocet), it's more or less obvious that the result will be strongly biased.
There may be several ways of getting a (vague) idea of the possible bias:
- keep track of how many +1/-1 steps there are between the original puzzle and the final one and state the results as a function of this distance;
- keep track of the Exocets in the original puzzle and those in the final one and check how the results vary if we count all the Exocets in the final puzzles or only those that weren't already there in the original ones.
denis_berthier
2010 Supporter

Posts: 1253
Joined: 19 June 2007
Location: Paris

Re: The Sudoku grey zone

coloin wrote:Probably most of the puzzles with an SK-loop are known.

C

Hi coloin,

I am running a first test in the grey zone and I can already tell you that this is not exact.

I studied years ago gsf's taxonomy file and found many puzzles with a relatively low rating having the sk loop
like that one I used as example on my website

Sample1 Coloin 02805 in gsf list (as of 2008 02 26)
600000002090400050001000700050943000000105000000800040007000600030009080200000001

The first "V" loop not recognised by ronk as a SK loop has been seen in a puzzle with a rating around SER 9.0

What will be interesting is to see the evolution of the frequency when the average rating goes down.
champagne
2017 Supporter

Posts: 5753
Joined: 02 August 2007
Location: France Brittany

How frequent are the J-Exocets in the grey zone?

How frequent are the J-Exocets in the grey zone?

For the 5,926,343 puzzles in the controlled-bias collection produced by the controlled-bias generator (*), I had computed long ago:
- the SER for the first 3,037,717
- the W rating for the whole collection (at that time, instead of "W rating", I said "pure NRCZT rating" but it's the same thing).

Considering only the first 3,037,717:
5615 have SER >= 9.0
2353 have W >= 8
664 have W >= 9

I have looked for JExocets in the 664 W>=9 cases, which corresponds to a stricter definition of the lower bound of the grey zone than SER >= 9.0

I activated JEs and the rules of SSTS - i.e. Whips[1] and (Naked, Hidden and Super-Hidden) Subset rules - and nothing else.
In order to avoid degenerated cases, JE's of any size were assigned lower priority than all the rules in SSTS.
By JE, I mean standard Jk-Exocets, with k = 2, 3, 4 or 5 (as defined in this post http://forum.enjoysudoku.com/pattern-based-classification-of-hard-puzzles-t30493-85.html). Franken or Blue's extensions were not taken into account (for the only reason that they are not programmed in SudoRules), but I don't think they could lead to very different stats.

No JExocet was found in any of these puzzles; so, the calculations for a rough estimate of the unbiased frequency in the grey zone shouldn't require a doctorate in statistics and I won't invest more personal time on this topic.

[Added: How frequent are sk-loops?: none found in this sample]
denis_berthier
2010 Supporter

Posts: 1253
Joined: 19 June 2007
Location: Paris

Distribution of clues in the grey zone

Distribution of clues in the grey zone

This question arose in another thread.

I haven't defined the grey zone in a very precise way. Depending on how I make it more precise, SER >= 9.0 or W>=9, different calculations can be done but, as long as the number of clues is concerned, they don't lead to significantly different results.

If I consider the whole collection of 5,926,343 puzzles generated by the controlled-bias generator, 1258 have their W rating >= 9.
The raw distribution of clues for them is as follows:

Code: Select all
`nb-clues   nb-instances  %19         020         021         022         023         22            1.724         106           8.425         306           24.326         415           33.027         288           22.928         102           8.129         17            1.430         2             0.231         032         033         034         035         0mean= 25.97standard-deviation= 1.20`

If I consider only the the first 3,037,717 for which I had computed the SER, 5615 have their SER >= 9.0. The raw distribution of clues for them is:

Code: Select all
`nb-clues   nb-instances    %19         020         021         022         2               0.0423         46              0.824         416             7.425         1319            23.526         1915            34.127         1380            24.628         440             7.8329         90              1.630         7               0.131         032         033         034         035         0mean= 26.05standard-deviation= 1.15`

For comparison, I recall the data for the whole cb-sample (see p.43 of the pdf in the "real distribution" thread):

Code: Select all
`nb-clues  nb-instances     %               20        2                3.7e-05         21        164              0.0027          22        6,651            0.1124          23        110,103          1.858         24        704,089          11.88        25        1,814,413        30.62         26        2,002,349        33.79         27        1,007,700        17.00         28        247,259          4.172         29        31,449           0.531         30        2,088            0.0352       31        74               0.00125       32        2                3.37e-05     mean= 25.67standard-deviation= 1.12`
denis_berthier
2010 Supporter

Posts: 1253
Joined: 19 June 2007
Location: Paris

PreviousNext