17-clue and 18-clue Sudoku update

Everything about Sudoku that doesn't fit in one of the other sections

Postby JPF » Wed Aug 01, 2007 7:40 am

coloin wrote:With canicalization gsf can get ...

what kind of strange animal can he get ?:D

JPF
JPF
2017 Supporter
 
Posts: 6127
Joined: 06 December 2005
Location: Paris, France

Postby Pat » Thu Aug 02, 2007 10:37 am

JPF wrote:what kind of strange animal can he get ?

      the walking fern
User avatar
Pat
 
Posts: 4056
Joined: 18 July 2005

Postby gsf » Thu Aug 02, 2007 2:47 pm

gfroyle wrote:
gsf wrote:so the equivalence classes are easy to derive
but this does not guarantee that all {-2+2} members are represented in each class
that would take the longer computation of generating each class by applying {-2+2}* to one member of each class


Seems too fast to me....

Let me clarify my understanding of what is happening

(1) Given a puzzle P, you can generate the list of 136 15-clue pseudo-puzzles - call that L(P)

(2) Two puzzles P and Q are DIRECTLY {-2+2} related if and only if L(P) and L(Q) contain at least one pseudo-puzzle in common

But now I cannot see how you can quickly compute equivalence classes from this information ... essentially we need to know all the DIRECT relationships (which seems to require checking L(P_i) against L(P_j) for all i \not= j)....

I'm sure I have missed something, probably obvious...

thanks for the back of the envelope skepticism
I had taken too simple an approach and did not reach closure
I added a check pass after the computation loop to verify your suspicions
basically one pass over the data gave a distance 1 approximation to the true closure
by making the first pass iterative the approximation from each pass is refined
15 rounds of refinement finally reached closure for |G| = 41176 puzzles in
real 1h21m50.23s
user 1h21m34.99s
sys 0m14.77s
this was done by a ksh script -- each round does lookups from scratch so it could be improved

the equivalence class sizes confirm some suspicions I have about some of the alternate
background processes I have running (which have all been stuck, apparently on the same class,
attempting to generate the closure of that one class)

the 1h21m computation shows why
here is a table of the #classes #members-in-class for the first 41176 puzzles in G
Code: Select all
8507 1
2665 2
 808 3
 490 4
 232 5
 155 6
  82 7
  78 8
  42 9
  39 10
  13 11
  23 12
  13 13
   9 14
  10 15
  14 16
   5 17
   5 18
   5 19
   2 20
   3 21
   3 22
   3 23
   2 24
   4 25
   3 26
   2 27
   1 28
   2 30
   1 31
   1 32
   1 33
   1 35
   1 41
   2 44
   2 46
   1 47
   1 61
   1 73
   1 77
   1 81
   1 16244

that last whopper, 1 class containing 16244 puzzles, is at the root of the direct closure stalls
looks like we have Pangaea and a bunch of micro islands, most of which have equivalence class size 1

this is for the {-2+2} data
I'll see if crunching the {-3+3} data is tractable
initial guess is it should take 10x longer
gsf
2014 Supporter
 
Posts: 7306
Joined: 21 September 2005
Location: NJ USA

Postby JPF » Thu Aug 02, 2007 3:20 pm

Very interesting and surprising results, gsf !

21% of the actual puzzles of G are {-2+2}-isolated and 39% {-2+2}-interconnected.
hmm.
I'd love to see one of this equivalence class ; the one with 81 puzzles ?

JPF
JPF
2017 Supporter
 
Posts: 6127
Joined: 06 December 2005
Location: Paris, France

Postby coloin » Thu Aug 02, 2007 6:22 pm

Indeed, well done.

These results are of course on the known 17 puzzles.

Which of the classes have been demonstrated to be closed ?

I am suspicious of some of the 17s #35000-#40000 which i am sure have not been subjected to a full{-2+2}. These will come out in the wash in due course !

However, I am still trying to understand why there should be 81 17-puzzles with over 10000 15-subpuzzles.....and not one is the same as the 2000000 different 15-subpuzzles in our Pangaea [P] class.

C
coloin
 
Posts: 2385
Joined: 05 May 2005
Location: Devon

Postby kjellfp » Thu Aug 02, 2007 7:42 pm

The {2}*-class listing was very surprising, gsf.

How about {1}* ? Does it also contain a Pangaea, or is it split into several continents?
kjellfp
 
Posts: 140
Joined: 04 October 2005

Postby gsf » Fri Aug 03, 2007 4:37 am

JPF wrote:I'd love to see one of this equivalence class ; the one with 81 puzzles ?

sorry for the delay
I had composed the reply and apparently forgot to submit

the {1}* ({-1+1} closure) and {2}* results for G index 1 through 41176 are posted at G.1c.gz and G.2c.gz

the line format is
Code: Select all
puzzle   G-index   class-index

where G-index is the index assigned by Gordon and class-index is the min G-index of all members in the class
the files are sorted by < class-index G-index >

this unix pipeline lists the size of each class
Code: Select all
cut -d' ' -f3 G.2c | sort | uniq -c | sort -n

which shows that the class with 81 members is 06892
this unix command lists that class
Code: Select all
grep '  06892$' G.2c


kjellfp wrote:How about {1}* ? Does it also contain a Pangaea, or is it split into several continents?


here are the #classes and class-size for {1}* -- mostly isolated
Code: Select all
19924 1
 5013 2
 1078 3
  622 4
  236 5
  133 6
   75 7
   50 8
   32 9
   23 10
   21 11
   19 12
    9 13
    7 14
    1 15
    7 16
    2 17
    2 18
    3 19
    4 20
    1 21
    2 22
    2 23
    2 24
    1 26
    3 27
    1 28
    1 29
    2 30
    1 33
    1 37
    1 45
    1 53
    3 60
    1 87
    1 257

computed by
Code: Select all
cut -d' ' -f3 G.1c | sort | uniq -c | sort -n | sed -e 's/^  *//' -e 's/ .*//' | uniq -c
gsf
2014 Supporter
 
Posts: 7306
Joined: 21 September 2005
Location: NJ USA

Postby gsf » Fri Aug 03, 2007 4:13 pm

a few pm's and double/triple checking show that the equivalence classes I posted were
based on data with some missing connections
which resulted in too many classes, i.e., some classes I labelled as distinct are in reality one class
(some inappropriate {-n+m} pruning slippped into the {-n} subgrid generation)
I believe that the raw closure code in my solver is correct
its just data derived from that that has the problem

I'll post data corrections by tomorrow
and also the methodolgy so the results can be replicated -- or refuted

this is an interesting offshoot because data size for {-3+3} and up is large enough to require
external store or other space saving algorithm adjustments
gsf
2014 Supporter
 
Posts: 7306
Joined: 21 September 2005
Location: NJ USA

Postby gsf » Tue Aug 07, 2007 6:13 am

well the update to the equivalence partitioning has taken longer than expected
the data sizes require a bit more than brute force

it turns out that determining equivalence classes is equivalent to finding connected components in a graph,
and that is linear time complexity

what took so long is that for |G| = 41324 the {4}* ({-4+4} closure) defines a graph with 98,351,120 edges,
with a graph data representation (using at&t's graphviz tools) file size of 8.1Gi
linear or not, that size input can bog down a connected components implementation

with the right massaging and a home brew algorithm I was able to compute the closure in 1 min
40min to compute the {4}* data using my solver, 16min to massage, 1min to compute connected components
graphviz ccomps took 72min on the massaged data

you can generate the initial data with my solver version 2007-08-01
Code: Select all
sudoku -goce{-4} -Fc'%05#in %#kc' g.dat > g.edges

(41324 input 17's produces 8.1Gi edge data -- change the 4 to 1,2,3 for different on/off edge data)

the graphviz ccomps command and my home brew agree for {1}*, but it would be good to get
independent corroboration on the equivalence class counts and sizes

here are the #classes/class-size for {1}* {2}* {3}* and {4}*:
Code: Select all
=== 41324 {1}* ===

18950    1
 4306    2
 1095    3
  645    4
  259    5
  170    6
   97    7
   59    8
   47    9
   30   10
   26   11
   18   12
   16   13
    8   14
    3   15
    9   16
    5   17
    5   18
    4   19
    4   20
    1   21
    3   22
    1   23
    5   24
    5   25
    2   27
    1   28
    1   30
    2   31
    2   32
    1   33
    1   37
    1   38
    2   41
    1   42
    1   50
    1   55
    1   87
    1  173
    1  560
    1  616

=== 41324 {2}* ===

 7734    1
 2214    2
  724    3
  424    4
  213    5
  135    6
   70    7
   65    8
   39    9
   29   10
   17   11
   17   12
   12   13
    7   14
    5   15
   12   16
    5   17
    4   18
    7   19
    2   20
    1   21
    3   22
    6   23
    4   24
    1   25
    1   26
    1   28
    2   30
    1   31
    1   32
    1   33
    3   35
    1   39
    1   43
    1   44
    1   56
    1   58
    1   64
    1   65
    1   70
    1   80
    2   82
    1 19182

=== 41324 {3}* ===

 1205    1
  240    2
   62    3
   38    4
   23    5
   16    6
    5    7
    1    8
    2    9
    1   10
    2   11
    1   12
    1   13
    1   22
    1 38950

=== 41324 {4}* ===

   12    1
    3    2
    1 41306

e.g., for {4}* there are 12 classes of size 1, 3 classes of size 2, and one class of size 41306

so what does this mean
first of all, the data posted above is empirical, based on the 17 catalog G with |G|=41324
the {4}* data shows that the catalog as it stands looks like it may be closed under {-4+4}
i.e., every puzzle in G may be within {-4+4} of every other puzzle

this could be proven by example by selecting a puzzle from the size 41306 class and
doing a {-4+4} closure and seeing if the other classes are absorbed (would that we could compute that expediently)

on the tractable side, it might mean that starting with puzzles in the other 15 classes might produce one or
more new 17's that bridge to the pangaea class

here are the puzzle indices and equivalence class label (lowest puzzle index in the class)
for the {4}* non-pangaea classes:
Code: Select all
00506 00506
01467 01467
01835 01835
12703 12703
14037 14037
20069 20069
21276 21276
22575 22575
30563 30563
30564 30563
32043 32043
40609 40609
40610 40609
40611 40611
40734 40734
41077 41077
41214 41077
41100 41100

e.g., { 40609 40610 } are in the 40609 {4}* class and 41100 is in a singleton class
gsf
2014 Supporter
 
Posts: 7306
Joined: 21 September 2005
Location: NJ USA

Postby gsf » Wed Aug 08, 2007 6:08 pm

an update on the order of the closure for G
a colleague introduced me to a linear time (in number of edges) algorithm for connected components
and another related algorithm for disjoint set union (the basic algorithms behind the closure computations)

using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}
the {5}* data took 1h44m to generate, the size is 21Gi
eliminating dup records took another 53m and brought it down to 1.5Gi:
104557112 edges, 36151486 inequivalent 12-clue subgrids (treated as vertices), and 41347 inequivalent grids (also treated as vertices)
both algorithms took 1m13s to compute the closure, resulting in 1 equivalence class
remarkable that two different algorithms, one linear, and one (possibly) quadratic in the worst case, matched runtimes so closely

this doesn't necessirily mean that the set of all 17 sudoku is closed under {-5+5}
there could be another 17 or group of 17's, yet to be discovered, in a separate partition

do we have an idea of the number of inequivalent 12-clue subgrids?
that may give a handle on how close the 41347 (which cover 36151486) are to being complete
gsf
2014 Supporter
 
Posts: 7306
Joined: 21 September 2005
Location: NJ USA

Postby Havard » Wed Aug 08, 2007 8:38 pm

Great work gsf, very impressive!:)

gsf wrote:do we have an idea of the number of inequivalent 12-clue subgrids?
that may give a handle on how close the 41347 (which cover 36151486) are to being complete


I know what you are thinking, but I can shake the feeling that we still don't know how many 17 can come out of the 36151486 subgrids. I am sure that if you did a "5on" on all those grids (would take forever, i know) would produce a lot more 17 than we already have. And since we are not aware of the "completeness" of our 17 collection in relation to the 12-subgrids, I am not sure a number of "total possible 12 subgrids" would be able to tell us much?

The same calculation done for 15 subgrids would be much more accurate, since we are pretty sure that we have all 17 that can come out of them. And calcultating the total number of 15 subgrids could even be easier then 12, since a lot more restrictions in constructing it will apply?

But these are just rantings...:) Great work again!:)

Havard
Havard
 
Posts: 378
Joined: 25 December 2005

Postby gsf » Wed Aug 08, 2007 9:57 pm

Havard wrote:
gsf wrote:do we have an idea of the number of inequivalent 12-clue subgrids?
that may give a handle on how close the 41347 (which cover 36151486) are to being complete


I know what you are thinking, but I can shake the feeling that we still don't know how many 17 can come out of the 36151486 subgrids. I am sure that if you did a "5on" on all those grids (would take forever, i know) would produce a lot more 17 than we already have. And since we are not aware of the "completeness" of our 17 collection in relation to the 12-subgrids, I am not sure a number of "total possible 12 subgrids" would be able to tell us much?

basically some back of the envelope thoughts
take it to extremes
suppose there were exactly 36151486 12 clue subgrids
then we would know that there is only 1 {-5+5} 17 clue equivalence class
now in the continuum of estimates for the number of 12 clue subgrids
at some point, maybe some order of magnitude times 36151486, we might be swayed to think there were more than one {-5+5} equivalence class

so I'm just grabbing for any hints on how to direct the 17 search
gsf
2014 Supporter
 
Posts: 7306
Joined: 21 September 2005
Location: NJ USA

Postby ronk » Wed Aug 08, 2007 11:36 pm

gsf wrote:using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}

So unknown 17s have Hamming distance d > 10 from all members of (code) set G. That would seem to be a rather significant result. Are there any pairs of set members with d = 34?
ronk
2012 Supporter
 
Posts: 4764
Joined: 02 November 2005
Location: Southeastern USA

Postby gsf » Thu Aug 09, 2007 2:03 am

ronk wrote:
gsf wrote:using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}

So unknown 17s have Hamming distance d > 10 from all members of (code) set G ...

... if/when we are able to compute the {-5+5} closure of G
ronk wrote:Are there any pairs of set members with d = 34?

this computes hamming distance in my solver:
Code: Select all
sudoku -C p1.dat p2.dat

I believe its correct, but its fairly compute intensive, so computing all 40K*40K will take too long
I checked the distance between the {4}* outliers and the max is
edit a subsequent post noted a problem with the distance algorithm -- it has been fixed and resultes corrected here
Code: Select all
distance 16
......6.4....81.........7...5.6..8...4...7......2.....2.7....1.1...3.......5.....
..3...........1...8....97.....6..8...4.....5..1.......2.7...91....4..6.....5.....

the largest known distance in G (a feeble search, only 2299658 puzzle pairs)
edit feeble search data elided because it was based on a flawed distance measure
Last edited by gsf on Thu Aug 09, 2007 11:19 am, edited 1 time in total.
gsf
2014 Supporter
 
Posts: 7306
Joined: 21 September 2005
Location: NJ USA

Postby ronk » Thu Aug 09, 2007 2:54 am

gsf wrote:
ronk wrote:
gsf wrote:using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}

So unknown 17s have Hamming distance d > 10 from all members of (code) set G ...

... if/when we are able to compute the {-5+5} closure of G

Color me confused, as I thought that's what your first statement above said.
ronk
2012 Supporter
 
Posts: 4764
Joined: 02 November 2005
Location: Southeastern USA

PreviousNext

Return to General