The Sudoku grey zone

Advanced methods and approaches for solving Sudoku puzzles

Re: The Sudoku grey zone

Postby denis_berthier » Wed May 22, 2013 10:46 am

eleven wrote:No, the tamagochi method would produce the same bias, because basically it uses the same method of neighbourhood search with filtering "easy" puzzles (it was just useful, that the criteria for expanding/filtering or adding new sets could be dynamically changed during the generation process).
The bias then comes from the fact, that puzzles in big "clouds" of hard puzzles with similar properties (!) will be over-represented, while others with a small number of toughies in their neigbourhood are not found. So its not the best choice to start with the hardest set to generate grey zone puzzles, if you want to know the commonness of Exocet patterns (because we know they are the more common, the harder the puzzles are).
Probably better would be a bottom-up generation with a limited number of neighbourhood extensions, but then the effort to get a set of the same size would be a multiple higher.

I wasn't proposing to start from the hardest collection - for the same reasons as you, I think the bias would be much too strong and no credible conclusion could be drawn as to patterns having high frequency in the hardest.
As far as I can remember, when you generated your hardest list you started from a random set. Couldn't the same process be applied with a different target (SER = x.x instead of SER as high as possible)? I'm not saying the resulting collection wouldn't be biased, but at least it would avoid the bias that one would have if they started with the hardest.


champagne wrote:it is somehow strange to read at the same momenb that some would know more about the grey zone but are not prepared to spend a penny (nor one cycle) to do it.

I could spend cycles on an old computer to run an existing program, but I have no time for writing one (and no longer any competence for writing a fast one, as may be necessary here).


Just as a reminder of how much different goals can change complexity and computation times, generating one puzzle with the controlled-bias generator required (in the mean) more than 250,000 times the effort to generate a puzzle with a top-down generator. It took me months of CPU to generate about 6,000,000 (which can be done in a few seconds if no goal for bias is set).
denis_berthier
2010 Supporter
 
Posts: 1253
Joined: 19 June 2007
Location: Paris

Re: The Sudoku grey zone

Postby eleven » Wed May 22, 2013 11:56 am

denis_berthier wrote:Couldn't the same process be applied with a different target (SER = x.x instead of SER as high as possible)?

Yes, of course, but the bias would become stronger with each follow-up neighbourhood search, so very much seeds should be taken - and most of them probably will not show a ER 9+ puzzle in the neighbourhood of a small number of say {-1+1} searches (while searching the neigbourhood of already hard puzzles will generate more and more).
So i am with champagne. Also a set (quickly) generated from the hardest will bring useful results (and maybe the exocets are already that rare in this set - or the eliminations so weak, that no further investigation is interesting).
eleven
 
Posts: 1537
Joined: 10 February 2008

Re: The Sudoku grey zone

Postby denis_berthier » Wed May 22, 2013 1:25 pm

eleven wrote:a set (quickly) generated from the hardest will bring useful results (and maybe the exocets are already that rare in this set - or the eliminations so weak, that no further investigation is interesting).

I agree that it can bring this kind of negative results and it may therefore be a useful preliminary step, but no positive result could have any degree of credibility.
denis_berthier
2010 Supporter
 
Posts: 1253
Joined: 19 June 2007
Location: Paris

Anomalies in the hardest list

Postby denis_berthier » Thu May 23, 2013 4:38 am

When we compare the origin of the puzzles in champagne's hardest list with the number of patterns of some type they contain, some noticeable differences appear. (Reference: classification 2012/12/24)

Number of puzzles declared has having some pattern (E, EE, sk*)
*: champagne renames this V-loop but this is what everyone knows as sk-loop.
Code: Select all
creator ->        GP       dob       elev        col        tarek
pattern
E              163413     260606     3726        460         34
EE              50675      85163     1253         32          1
sk               3281      11040       29        101        240


What's interesting is the ratio of some pattern (EE, sk) wrt Exocet:

Code: Select all
creator ->        GP        dob       elev      col         tarek
pattern
EE/E*100          31         33        34         7            3
sk/E*100           2          4       0.8        22          706

Which conclusions to draw is not clear but there's a very strong bias somewhere.
denis_berthier
2010 Supporter
 
Posts: 1253
Joined: 19 June 2007
Location: Paris

Re: Anomalies in the hardest list

Postby champagne » Thu May 23, 2013 10:04 am

denis_berthier wrote:When we compare the origin of the puzzles in champagne's hardest list with the number of patterns of some type they contain, some noticeable differences appear. (Reference: classification 2012/12/24)


I really don't see what kind of information can give the split per "creator".

We all know that at the start, providers selected by hand some files and worked mainly in the 20-22 clues area.
The SK loop pattern is requiring four empty cells in rectangle in four boxes. This was relatively common in the patterns selected by hand.
Surely the SK loop pattern frequency has been over estimated at the start.

What I see with not much changes since the file has a reasonable size is

a frequency of the exocet in these SER ratings slightly below 80%
a surprising high frequency of the double exocet pattern over 20%
a slightly decreasing frequency of the SK loop pattern below 3%
and the rank 0 logic (but we are far from having covered the potential) around 6.5%

My personal intimate conviction is that we have now in the file a huge majority of "potential hardest puzzles" with less than 24 clues. That does not give that much room for a big bias.

In the 24 25 clues area, there is still room for a significant deviation from the total frequency, but the process applied is not so biased and I would not be surprised if we had already more than 50% of the potential hardest of 24 clues
champagne
2017 Supporter
 
Posts: 5653
Joined: 02 August 2007
Location: France Brittany

Re: Anomalies in the hardest list

Postby denis_berthier » Thu May 23, 2013 10:52 am

champagne wrote:I really don't see what kind of information can give the split per "creator".

Many kinds (what I did above is the typical kind of consistency checks one makes when having multiple sources of information for the same topic and before fusing them):

- coloin and tarek are strongly biased for sk-loops;
- they are also strongly biased for the EE/E proportion and therefore probably also for E;
(these are the most likely hypotheses; the alternative being that GP, dob, elev are the strongly biased ones);

- at least one of GP, dob, elev is strongly biased for sk-loops; is the real frequency close to 0.8 or to 4 (a ratio 1 to 5 cannot be neglected); as, AFAIK, GP, dob and elev used similar generation techniques, it is strange that they get similar EE/E ratios but so different sk/E ratios.


I don't comment personal opinions.
denis_berthier
2010 Supporter
 
Posts: 1253
Joined: 19 June 2007
Location: Paris

Re: The Sudoku grey zone

Postby coloin » Thu May 23, 2013 11:28 am

Its no surprise that the puzzles that we found when Easter Monster came out - a large proportion were found to have the SK-loop.

Easter Monster was made by taking the 16-clue backbone [ everything but box 5]
Generate the potential grid completions with a full box 5
Remove clues from box 5

All the puzzles will have the sk-loop - the 16-clue back bone has a significantly reduced solution count.

Tarek and I generated subsequent puzzles with a -1+1 until overwhelmed !

Probably most of the puzzles with an SK-loop are known.

C
coloin
 
Posts: 1633
Joined: 05 May 2005

Re: The Sudoku grey zone

Postby denis_berthier » Thu May 23, 2013 11:42 am

Hi coloin,

So this explains the sk/E discrepancy.
Do you have any idea about the EE/E discrepancy ?
denis_berthier
2010 Supporter
 
Posts: 1253
Joined: 19 June 2007
Location: Paris

Re: The Sudoku grey zone

Postby champagne » Thu May 23, 2013 11:58 am

coloin wrote:Probably most of the puzzles with an SK-loop are known.

C

Hi coloin,

There is an evidence that any new entry in the data base must be new towards previous ones, so if a deep search on patterns eligible to the sk loop has been made, next entries must have a lower ratio.

This is correct

Nevertheless, the SK loop can also be found with 24 25 clues, with less chances to get the favourable pattern.

In my last run, for pending new entries, I have roughly 1% of puzzles having the sk loop
champagne
2017 Supporter
 
Posts: 5653
Joined: 02 August 2007
Location: France Brittany

Re: The Sudoku grey zone

Postby coloin » Thu May 23, 2013 4:06 pm

Well i hadnt realized that they could be present with 25 clues.
The first puzzles which had exocets were found by chance and we wernt really aware. A vicinty search on these puzzles by more recent contributers probably explains why they are much more frequent.
C
coloin
 
Posts: 1633
Joined: 05 May 2005

Re: Anomalies in the hardest list

Postby eleven » Thu May 23, 2013 5:02 pm

denis_berthier wrote: AFAIK, GP, dob and elev used similar generation techniques, it is strange that they get similar EE/E ratios but so different sk/E ratios.

I refound a good part of tarek's puzzles, so my real (overall) ratio sk/E was higher.
[Added:] To be more precise: In my summary 2 years ago i wrote:
[i found] Almost 90% of the known ER 11+ puzzles (independantly - only 92 of them passed my rating filters)
eleven
 
Posts: 1537
Joined: 10 February 2008

Re: The Sudoku grey zone

Postby denis_berthier » Fri May 24, 2013 6:24 am

So, we now have a better idea of what's in the "potential hardest" list.

If using it as a starting point for a vicinity search for any pattern with lower SER, say JExocet (or Exocet), it's more or less obvious that the result will be strongly biased.
There may be several ways of getting a (vague) idea of the possible bias:
- keep track of how many +1/-1 steps there are between the original puzzle and the final one and state the results as a function of this distance;
- keep track of the Exocets in the original puzzle and those in the final one and check how the results vary if we count all the Exocets in the final puzzles or only those that weren't already there in the original ones.
denis_berthier
2010 Supporter
 
Posts: 1253
Joined: 19 June 2007
Location: Paris

Re: The Sudoku grey zone

Postby champagne » Tue May 28, 2013 1:32 pm

coloin wrote:Probably most of the puzzles with an SK-loop are known.

C


Hi coloin,

I am running a first test in the grey zone and I can already tell you that this is not exact.

I studied years ago gsf's taxonomy file and found many puzzles with a relatively low rating having the sk loop
like that one I used as example on my website

Sample1 Coloin 02805 in gsf list (as of 2008 02 26)
600000002090400050001000700050943000000105000000800040007000600030009080200000001

The first "V" loop not recognised by ronk as a SK loop has been seen in a puzzle with a rating around SER 9.0

What will be interesting is to see the evolution of the frequency when the average rating goes down.
champagne
2017 Supporter
 
Posts: 5653
Joined: 02 August 2007
Location: France Brittany

How frequent are the J-Exocets in the grey zone?

Postby denis_berthier » Mon Jun 03, 2013 3:51 pm



How frequent are the J-Exocets in the grey zone?


For the 5,926,343 puzzles in the controlled-bias collection produced by the controlled-bias generator (*), I had computed long ago:
- the SER for the first 3,037,717
- the W rating for the whole collection (at that time, instead of "W rating", I said "pure NRCZT rating" but it's the same thing).
(*) For details about this, see the "real distribution of minimal puzzles" thread.

Considering only the first 3,037,717:
5615 have SER >= 9.0
2353 have W >= 8
664 have W >= 9


I have looked for JExocets in the 664 W>=9 cases, which corresponds to a stricter definition of the lower bound of the grey zone than SER >= 9.0

I activated JEs and the rules of SSTS - i.e. Whips[1] and (Naked, Hidden and Super-Hidden) Subset rules - and nothing else.
In order to avoid degenerated cases, JE's of any size were assigned lower priority than all the rules in SSTS.
By JE, I mean standard Jk-Exocets, with k = 2, 3, 4 or 5 (as defined in this post http://forum.enjoysudoku.com/pattern-based-classification-of-hard-puzzles-t30493-85.html). Franken or Blue's extensions were not taken into account (for the only reason that they are not programmed in SudoRules), but I don't think they could lead to very different stats.

No JExocet was found in any of these puzzles; so, the calculations for a rough estimate of the unbiased frequency in the grey zone shouldn't require a doctorate in statistics and I won't invest more personal time on this topic.


[Added: How frequent are sk-loops?: none found in this sample]
denis_berthier
2010 Supporter
 
Posts: 1253
Joined: 19 June 2007
Location: Paris

Distribution of clues in the grey zone

Postby denis_berthier » Tue Jul 02, 2013 4:40 am



Distribution of clues in the grey zone


This question arose in another thread.

I haven't defined the grey zone in a very precise way. Depending on how I make it more precise, SER >= 9.0 or W>=9, different calculations can be done but, as long as the number of clues is concerned, they don't lead to significantly different results.

If I consider the whole collection of 5,926,343 puzzles generated by the controlled-bias generator, 1258 have their W rating >= 9.
The raw distribution of clues for them is as follows:

Code: Select all
nb-clues   nb-instances  %
19         0
20         0
21         0
22         0
23         22            1.7
24         106           8.4
25         306           24.3
26         415           33.0
27         288           22.9
28         102           8.1
29         17            1.4
30         2             0.2
31         0
32         0
33         0
34         0
35         0
mean= 25.97
standard-deviation= 1.20


If I consider only the the first 3,037,717 for which I had computed the SER, 5615 have their SER >= 9.0. The raw distribution of clues for them is:

Code: Select all
nb-clues   nb-instances    %
19         0
20         0
21         0
22         2               0.04
23         46              0.8
24         416             7.4
25         1319            23.5
26         1915            34.1
27         1380            24.6
28         440             7.83
29         90              1.6
30         7               0.1
31         0
32         0
33         0
34         0
35         0
mean= 26.05
standard-deviation= 1.15



For comparison, I recall the data for the whole cb-sample (see p.43 of the pdf in the "real distribution" thread):

Code: Select all
nb-clues  nb-instances     %               
20        2                3.7e-05         
21        164              0.0027         
22        6,651            0.1124         
23        110,103          1.858         
24        704,089          11.88       
25        1,814,413        30.62         
26        2,002,349        33.79         
27        1,007,700        17.00         
28        247,259          4.172         
29        31,449           0.531         
30        2,088            0.0352       
31        74               0.00125       
32        2                3.37e-05     
mean= 25.67
standard-deviation= 1.12

denis_berthier
2010 Supporter
 
Posts: 1253
Joined: 19 June 2007
Location: Paris

PreviousNext

Return to Advanced solving techniques