## 17-clue and 18-clue Sudoku update

Everything about Sudoku that doesn't fit in one of the other sections
coloin wrote:With canicalization gsf can get ...

what kind of strange animal can he get ?

JPF
JPF
2017 Supporter

Posts: 3754
Joined: 06 December 2005
Location: Paris, France

JPF wrote:what kind of strange animal can he get ?

the walking fern

Pat

Posts: 3641
Joined: 18 July 2005

gfroyle wrote:
gsf wrote:so the equivalence classes are easy to derive
but this does not guarantee that all {-2+2} members are represented in each class
that would take the longer computation of generating each class by applying {-2+2}* to one member of each class

Seems too fast to me....

Let me clarify my understanding of what is happening

(1) Given a puzzle P, you can generate the list of 136 15-clue pseudo-puzzles - call that L(P)

(2) Two puzzles P and Q are DIRECTLY {-2+2} related if and only if L(P) and L(Q) contain at least one pseudo-puzzle in common

But now I cannot see how you can quickly compute equivalence classes from this information ... essentially we need to know all the DIRECT relationships (which seems to require checking L(P_i) against L(P_j) for all i \not= j)....

I'm sure I have missed something, probably obvious...

thanks for the back of the envelope skepticism
I had taken too simple an approach and did not reach closure
I added a check pass after the computation loop to verify your suspicions
basically one pass over the data gave a distance 1 approximation to the true closure
by making the first pass iterative the approximation from each pass is refined
15 rounds of refinement finally reached closure for |G| = 41176 puzzles in
real 1h21m50.23s
user 1h21m34.99s
sys 0m14.77s
this was done by a ksh script -- each round does lookups from scratch so it could be improved

the equivalence class sizes confirm some suspicions I have about some of the alternate
background processes I have running (which have all been stuck, apparently on the same class,
attempting to generate the closure of that one class)

the 1h21m computation shows why
here is a table of the #classes #members-in-class for the first 41176 puzzles in G
Code: Select all
`8507 12665 2 808 3 490 4 232 5 155 6  82 7  78 8  42 9  39 10  13 11  23 12  13 13   9 14  10 15  14 16   5 17   5 18   5 19   2 20   3 21   3 22   3 23   2 24   4 25   3 26   2 27   1 28   2 30   1 31   1 32   1 33   1 35   1 41   2 44   2 46   1 47   1 61   1 73   1 77   1 81   1 16244`

that last whopper, 1 class containing 16244 puzzles, is at the root of the direct closure stalls
looks like we have Pangaea and a bunch of micro islands, most of which have equivalence class size 1

this is for the {-2+2} data
I'll see if crunching the {-3+3} data is tractable
initial guess is it should take 10x longer
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

Very interesting and surprising results, gsf !

21% of the actual puzzles of G are {-2+2}-isolated and 39% {-2+2}-interconnected.
hmm.
I'd love to see one of this equivalence class ; the one with 81 puzzles ?

JPF
JPF
2017 Supporter

Posts: 3754
Joined: 06 December 2005
Location: Paris, France

Indeed, well done.

These results are of course on the known 17 puzzles.

Which of the classes have been demonstrated to be closed ?

I am suspicious of some of the 17s #35000-#40000 which i am sure have not been subjected to a full{-2+2}. These will come out in the wash in due course !

However, I am still trying to understand why there should be 81 17-puzzles with over 10000 15-subpuzzles.....and not one is the same as the 2000000 different 15-subpuzzles in our Pangaea [P] class.

C
coloin

Posts: 1733
Joined: 05 May 2005

The {2}*-class listing was very surprising, gsf.

How about {1}* ? Does it also contain a Pangaea, or is it split into several continents?
kjellfp

Posts: 140
Joined: 04 October 2005

JPF wrote:I'd love to see one of this equivalence class ; the one with 81 puzzles ?

sorry for the delay

the {1}* ({-1+1} closure) and {2}* results for G index 1 through 41176 are posted at G.1c.gz and G.2c.gz

the line format is
Code: Select all
`puzzle   G-index   class-index`

where G-index is the index assigned by Gordon and class-index is the min G-index of all members in the class
the files are sorted by < class-index G-index >

this unix pipeline lists the size of each class
Code: Select all
`cut -d' ' -f3 G.2c | sort | uniq -c | sort -n`

which shows that the class with 81 members is 06892
this unix command lists that class
Code: Select all
`grep '  06892\$' G.2c`

kjellfp wrote:How about {1}* ? Does it also contain a Pangaea, or is it split into several continents?

here are the #classes and class-size for {1}* -- mostly isolated
Code: Select all
`19924 1 5013 2 1078 3  622 4  236 5  133 6   75 7   50 8   32 9   23 10   21 11   19 12    9 13    7 14    1 15    7 16    2 17    2 18    3 19    4 20    1 21    2 22    2 23    2 24    1 26    3 27    1 28    1 29    2 30    1 33    1 37    1 45    1 53    3 60    1 87    1 257`

computed by
Code: Select all
`cut -d' ' -f3 G.1c | sort | uniq -c | sort -n | sed -e 's/^  *//' -e 's/ .*//' | uniq -c`
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

a few pm's and double/triple checking show that the equivalence classes I posted were
based on data with some missing connections
which resulted in too many classes, i.e., some classes I labelled as distinct are in reality one class
(some inappropriate {-n+m} pruning slippped into the {-n} subgrid generation)
I believe that the raw closure code in my solver is correct
its just data derived from that that has the problem

I'll post data corrections by tomorrow
and also the methodolgy so the results can be replicated -- or refuted

this is an interesting offshoot because data size for {-3+3} and up is large enough to require
external store or other space saving algorithm adjustments
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

well the update to the equivalence partitioning has taken longer than expected
the data sizes require a bit more than brute force

it turns out that determining equivalence classes is equivalent to finding connected components in a graph,
and that is linear time complexity

what took so long is that for |G| = 41324 the {4}* ({-4+4} closure) defines a graph with 98,351,120 edges,
with a graph data representation (using at&t's graphviz tools) file size of 8.1Gi
linear or not, that size input can bog down a connected components implementation

with the right massaging and a home brew algorithm I was able to compute the closure in 1 min
40min to compute the {4}* data using my solver, 16min to massage, 1min to compute connected components
graphviz ccomps took 72min on the massaged data

you can generate the initial data with my solver version 2007-08-01
Code: Select all
`sudoku -goce{-4} -Fc'%05#in %#kc' g.dat > g.edges`

(41324 input 17's produces 8.1Gi edge data -- change the 4 to 1,2,3 for different on/off edge data)

the graphviz ccomps command and my home brew agree for {1}*, but it would be good to get
independent corroboration on the equivalence class counts and sizes

here are the #classes/class-size for {1}* {2}* {3}* and {4}*:
Code: Select all
`=== 41324 {1}* ===18950    1 4306    2 1095    3  645    4  259    5  170    6   97    7   59    8   47    9   30   10   26   11   18   12   16   13    8   14    3   15    9   16    5   17    5   18    4   19    4   20    1   21    3   22    1   23    5   24    5   25    2   27    1   28    1   30    2   31    2   32    1   33    1   37    1   38    2   41    1   42    1   50    1   55    1   87    1  173    1  560    1  616=== 41324 {2}* === 7734    1 2214    2  724    3  424    4  213    5  135    6   70    7   65    8   39    9   29   10   17   11   17   12   12   13    7   14    5   15   12   16    5   17    4   18    7   19    2   20    1   21    3   22    6   23    4   24    1   25    1   26    1   28    2   30    1   31    1   32    1   33    3   35    1   39    1   43    1   44    1   56    1   58    1   64    1   65    1   70    1   80    2   82    1 19182=== 41324 {3}* === 1205    1  240    2   62    3   38    4   23    5   16    6    5    7    1    8    2    9    1   10    2   11    1   12    1   13    1   22    1 38950=== 41324 {4}* ===   12    1    3    2    1 41306`

e.g., for {4}* there are 12 classes of size 1, 3 classes of size 2, and one class of size 41306

so what does this mean
first of all, the data posted above is empirical, based on the 17 catalog G with |G|=41324
the {4}* data shows that the catalog as it stands looks like it may be closed under {-4+4}
i.e., every puzzle in G may be within {-4+4} of every other puzzle

this could be proven by example by selecting a puzzle from the size 41306 class and
doing a {-4+4} closure and seeing if the other classes are absorbed (would that we could compute that expediently)

on the tractable side, it might mean that starting with puzzles in the other 15 classes might produce one or
more new 17's that bridge to the pangaea class

here are the puzzle indices and equivalence class label (lowest puzzle index in the class)
for the {4}* non-pangaea classes:
Code: Select all
`00506 0050601467 0146701835 0183512703 1270314037 1403720069 2006921276 2127622575 2257530563 3056330564 3056332043 3204340609 4060940610 4060940611 4061140734 4073441077 4107741214 4107741100 41100`

e.g., { 40609 40610 } are in the 40609 {4}* class and 41100 is in a singleton class
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

an update on the order of the closure for G
a colleague introduced me to a linear time (in number of edges) algorithm for connected components
and another related algorithm for disjoint set union (the basic algorithms behind the closure computations)

using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}
the {5}* data took 1h44m to generate, the size is 21Gi
eliminating dup records took another 53m and brought it down to 1.5Gi:
104557112 edges, 36151486 inequivalent 12-clue subgrids (treated as vertices), and 41347 inequivalent grids (also treated as vertices)
both algorithms took 1m13s to compute the closure, resulting in 1 equivalence class
remarkable that two different algorithms, one linear, and one (possibly) quadratic in the worst case, matched runtimes so closely

this doesn't necessirily mean that the set of all 17 sudoku is closed under {-5+5}
there could be another 17 or group of 17's, yet to be discovered, in a separate partition

do we have an idea of the number of inequivalent 12-clue subgrids?
that may give a handle on how close the 41347 (which cover 36151486) are to being complete
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

Great work gsf, very impressive!

gsf wrote:do we have an idea of the number of inequivalent 12-clue subgrids?
that may give a handle on how close the 41347 (which cover 36151486) are to being complete

I know what you are thinking, but I can shake the feeling that we still don't know how many 17 can come out of the 36151486 subgrids. I am sure that if you did a "5on" on all those grids (would take forever, i know) would produce a lot more 17 than we already have. And since we are not aware of the "completeness" of our 17 collection in relation to the 12-subgrids, I am not sure a number of "total possible 12 subgrids" would be able to tell us much?

The same calculation done for 15 subgrids would be much more accurate, since we are pretty sure that we have all 17 that can come out of them. And calcultating the total number of 15 subgrids could even be easier then 12, since a lot more restrictions in constructing it will apply?

But these are just rantings... Great work again!

Havard
Havard

Posts: 377
Joined: 25 December 2005

Havard wrote:
gsf wrote:do we have an idea of the number of inequivalent 12-clue subgrids?
that may give a handle on how close the 41347 (which cover 36151486) are to being complete

I know what you are thinking, but I can shake the feeling that we still don't know how many 17 can come out of the 36151486 subgrids. I am sure that if you did a "5on" on all those grids (would take forever, i know) would produce a lot more 17 than we already have. And since we are not aware of the "completeness" of our 17 collection in relation to the 12-subgrids, I am not sure a number of "total possible 12 subgrids" would be able to tell us much?

basically some back of the envelope thoughts
take it to extremes
suppose there were exactly 36151486 12 clue subgrids
then we would know that there is only 1 {-5+5} 17 clue equivalence class
now in the continuum of estimates for the number of 12 clue subgrids
at some point, maybe some order of magnitude times 36151486, we might be swayed to think there were more than one {-5+5} equivalence class

so I'm just grabbing for any hints on how to direct the 17 search
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

gsf wrote:using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}

So unknown 17s have Hamming distance d > 10 from all members of (code) set G. That would seem to be a rather significant result. Are there any pairs of set members with d = 34?
ronk
2012 Supporter

Posts: 4764
Joined: 02 November 2005
Location: Southeastern USA

ronk wrote:
gsf wrote:using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}

So unknown 17s have Hamming distance d > 10 from all members of (code) set G ...

... if/when we are able to compute the {-5+5} closure of G
ronk wrote:Are there any pairs of set members with d = 34?

this computes hamming distance in my solver:
Code: Select all
`sudoku -C p1.dat p2.dat`

I believe its correct, but its fairly compute intensive, so computing all 40K*40K will take too long
I checked the distance between the {4}* outliers and the max is
edit a subsequent post noted a problem with the distance algorithm -- it has been fixed and resultes corrected here
Code: Select all
`distance 16......6.4....81.........7...5.6..8...4...7......2.....2.7....1.1...3.......5.......3...........1...8....97.....6..8...4.....5..1.......2.7...91....4..6.....5.....`

the largest known distance in G (a feeble search, only 2299658 puzzle pairs)
edit feeble search data elided because it was based on a flawed distance measure
Last edited by gsf on Thu Aug 09, 2007 11:19 am, edited 1 time in total.
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

gsf wrote:
ronk wrote:
gsf wrote:using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}

So unknown 17s have Hamming distance d > 10 from all members of (code) set G ...

... if/when we are able to compute the {-5+5} closure of G

Color me confused, as I thought that's what your first statement above said.
ronk
2012 Supporter

Posts: 4764
Joined: 02 November 2005
Location: Southeastern USA

PreviousNext