The New Sudoku Players' Forum

by **gsf** » Tue Aug 07, 2007 6:13 am

well the update to the equivalence partitioning has taken longer than expected
the data sizes require a bit more than brute force

it turns out that determining equivalence classes is equivalent to finding connected components in a graph,
and that is linear time complexity

what took so long is that for |G| = 41324 the {4}* ({-4+4} closure) defines a graph with 98,351,120 edges,
with a graph data representation (using at&t's graphviz tools) file size of 8.1Gi
linear or not, that size input can bog down a connected components implementation

with the right massaging and a home brew algorithm I was able to compute the closure in 1 min
40min to compute the {4}* data using my solver, 16min to massage, 1min to compute connected components
graphviz ccomps took 72min on the massaged data

you can generate the initial data with my solver version 2007-08-01

Code: Select all: sudoku -goce{-4} -Fc'%05#in %#kc' g.dat > g.edges

(41324 input 17's produces 8.1Gi edge data -- change the 4 to 1,2,3 for different on/off edge data)

the graphviz ccomps command and my home brew agree for {1}*, but it would be good to get
independent corroboration on the equivalence class counts and sizes

here are the #classes/class-size for {1}* {2}* {3}* and {4}*:

Code: Select all: === 41324 {1}* === 18950 1 4306 2 1095 3 645 4 259 5 170 6 97 7 59 8 47 9 30 10 26 11 18 12 16 13 8 14 3 15 9 16 5 17 5 18 4 19 4 20 1 21 3 22 1 23 5 24 5 25 2 27 1 28 1 30 2 31 2 32 1 33 1 37 1 38 2 41 1 42 1 50 1 55 1 87 1 173 1 560 1 616 === 41324 {2}* === 7734 1 2214 2 724 3 424 4 213 5 135 6 70 7 65 8 39 9 29 10 17 11 17 12 12 13 7 14 5 15 12 16 5 17 4 18 7 19 2 20 1 21 3 22 6 23 4 24 1 25 1 26 1 28 2 30 1 31 1 32 1 33 3 35 1 39 1 43 1 44 1 56 1 58 1 64 1 65 1 70 1 80 2 82 1 19182 === 41324 {3}* === 1205 1 240 2 62 3 38 4 23 5 16 6 5 7 1 8 2 9 1 10 2 11 1 12 1 13 1 22 1 38950 === 41324 {4}* === 12 1 3 2 1 41306

e.g., for {4}* there are 12 classes of size 1, 3 classes of size 2, and one class of size 41306

so what does this mean
first of all, the data posted above is empirical, based on the 17 catalog G with |G|=41324
the {4}* data shows that the catalog as it stands looks like it may be closed under {-4+4}
i.e., every puzzle in G may be within {-4+4} of every other puzzle

this could be proven by example by selecting a puzzle from the size 41306 class and
doing a {-4+4} closure and seeing if the other classes are absorbed (would that we could compute that expediently)

on the tractable side, it might mean that starting with puzzles in the other 15 classes might produce one or
more new 17's that bridge to the pangaea class

here are the puzzle indices and equivalence class label (lowest puzzle index in the class)
for the {4}* non-pangaea classes:

Code: Select all: 00506 00506 01467 01467 01835 01835 12703 12703 14037 14037 20069 20069 21276 21276 22575 22575 30563 30563 30564 30563 32043 32043 40609 40609 40610 40609 40611 40611 40734 40734 41077 41077 41214 41077 41100 41100

e.g., { 40609 40610 } are in the 40609 {4}* class and 41100 is in a singleton class

by **gsf** » Wed Aug 08, 2007 6:08 pm

an update on the order of the closure for G
a colleague introduced me to a linear time (in number of edges) algorithm for connected components
and another related algorithm for disjoint set union (the basic algorithms behind the closure computations)

using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}
the {5}* data took 1h44m to generate, the size is 21Gi
eliminating dup records took another 53m and brought it down to 1.5Gi:
104557112 edges, 36151486 inequivalent 12-clue subgrids (treated as vertices), and 41347 inequivalent grids (also treated as vertices)
both algorithms took 1m13s to compute the closure, resulting in 1 equivalence class
remarkable that two different algorithms, one linear, and one (possibly) quadratic in the worst case, matched runtimes so closely

this doesn't necessirily mean that the set of all 17 sudoku is closed under {-5+5}
there could be another 17 or group of 17's, yet to be discovered, in a separate partition

do we have an idea of the number of inequivalent 12-clue subgrids?
that may give a handle on how close the 41347 (which cover 36151486) are to being complete

by **Havard** » Wed Aug 08, 2007 8:38 pm

Great work gsf, very impressive!

gsf wrote:do we have an idea of the number of inequivalent 12-clue subgrids?
that may give a handle on how close the 41347 (which cover 36151486) are to being complete

I know what you are thinking, but I can shake the feeling that we still don't know how many 17 can come out of the 36151486 subgrids. I am sure that if you did a "5on" on all those grids (would take forever, i know) would produce a lot more 17 than we already have. And since we are not aware of the "completeness" of our 17 collection in relation to the 12-subgrids, I am not sure a number of "total possible 12 subgrids" would be able to tell us much?

The same calculation done for 15 subgrids would be much more accurate, since we are pretty sure that we have all 17 that can come out of them. And calcultating the total number of 15 subgrids could even be easier then 12, since a lot more restrictions in constructing it will apply?

But these are just rantings...

Great work again!

Havard

by **gsf** » Wed Aug 08, 2007 9:57 pm

Havard wrote:
gsf wrote:do we have an idea of the number of inequivalent 12-clue subgrids?
that may give a handle on how close the 41347 (which cover 36151486) are to being complete

I know what you are thinking, but I can shake the feeling that we still don't know how many 17 can come out of the 36151486 subgrids. I am sure that if you did a "5on" on all those grids (would take forever, i know) would produce a lot more 17 than we already have. And since we are not aware of the "completeness" of our 17 collection in relation to the 12-subgrids, I am not sure a number of "total possible 12 subgrids" would be able to tell us much?

basically some back of the envelope thoughts
take it to extremes
suppose there were exactly 36151486 12 clue subgrids
then we would know that there is only 1 {-5+5} 17 clue equivalence class
now in the continuum of estimates for the number of 12 clue subgrids
at some point, maybe some order of magnitude times 36151486, we might be swayed to think there were more than one {-5+5} equivalence class

so I'm just grabbing for any hints on how to direct the 17 search

by **ronk** » Wed Aug 08, 2007 11:36 pm

gsf wrote:using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}

So unknown 17s have Hamming distance d > 10 from all members of (code) set G. That would seem to be a rather significant result. Are there any pairs of set members with d = 34?

by **gsf** » Thu Aug 09, 2007 2:03 am

ronk wrote:
gsf wrote:using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}

So unknown 17s have Hamming distance d > 10 from all members of (code) set G ...

... if/when we are able to compute the {-5+5} closure of G

ronk wrote:Are there any pairs of set members with d = 34?

this computes hamming distance in my solver:

Code: Select all: sudoku -C p1.dat p2.dat

I believe its correct, but its fairly compute intensive, so computing all 40K*40K will take too long
I checked the distance between the {4}* outliers and the max is
edit a subsequent post noted a problem with the distance algorithm -- it has been fixed and resultes corrected here

Code: Select all: distance 16 ......6.4....81.........7...5.6..8...4...7......2.....2.7....1.1...3.......5..... ..3...........1...8....97.....6..8...4.....5..1.......2.7...91....4..6.....5.....

the largest known distance in G (a feeble search, only 2299658 puzzle pairs)
edit feeble search data elided because it was based on a flawed distance measure

by **ronk** » Thu Aug 09, 2007 2:54 am

gsf wrote:
ronk wrote:
gsf wrote:using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}

So unknown 17s have Hamming distance d > 10 from all members of (code) set G ...

... if/when we are able to compute the {-5+5} closure of G

Color me confused, as I thought that's what your first statement above said.

by **gsf** » Thu Aug 09, 2007 3:12 am

ronk wrote:
gsf wrote:
ronk wrote:
gsf wrote:using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}

So unknown 17s have Hamming distance d > 10 from all members of (code) set G ...

... if/when we are able to compute the {-5+5} closure of G

Color me confused, as I thought that's what your first statement above said.

aha
I had to rewrite a few times before sending -- and your confusion confirms I didn't get it quite right
there's too many closures in scope

the computation I did was to take G (the known 17s), do a {-5} (5-off) to derive all 12 clue subgrids,
and then did a closure based on the subgrids using a connected components and disjoint set union algorithms
this particular closure does not generate new 17sbut instead determines paths from any known 17 to any other known 17 via the 12s
this closure yielded one equivalence class
which means that all of the known 17s are in the {-5+5} closure
I think it was coloin that mentioned the possibility of this computation a few posts back

its neat that we can assert this without computing the entire {-5+5} closure

so my previous reply was based on this interpretation of unknown:
since we haven't confirmed the {-5+5} closure of G, some unknowns could be in that closure
there could also be other unknowns outside the closure or the {-5+5} closure could contain all 17s

by **gfroyle** » Thu Aug 09, 2007 6:09 am

gsf wrote:I had to rewrite a few times before sending -- and your confusion confirms I didn't get it quite right
there's too many closures in scope

Indeed...

Basically the class is CLOSED if it contains *all* 17s that can be obtained from any of the others by a "move" (in your case {-5+5} but the concept is general).

The class is CONNECTED if for any two 17s you can move from one to the other via an entire *sequence* of moves using other 17s as intermediate steps.

What we know from Glenn's work is that this graph is connected, even though we may not know the entire closure... basically at the moment we can view our existing collection of 17s as the "visible" portion of a possibly much larger "submerged" graph with new vertices waiting to be found. Doing a full {-5+5} on any vertex (impossible I know) is equivalent to completely hauling THAT VERTEX and all of its neighbours out of the water, thus possibly revealing new vertices.

What *I* want to know is more detail about the structure of the graph on the 17s... what are the degrees of the vertices (i.e. the number of neighbours that each puzzle/vertex has), which puzzles have the highest degree etc. (Could it even be that this graph is a scale-free graph in the Barabasi et al sense?)

I can see that at some stage I will have to compute this darn graph myself and just confirm your numbers....

Cheers

Gordon

by **ravel** » Thu Aug 09, 2007 8:27 am

gsf wrote:
Code: Select all
distance 20 .......1.4.........2...........5.4.7..8...3....1.9....3..4..2...5.1........8.6... ...2...1.4...3.6...............5.3....1....7.6...4....5.....4...8.7........1...2.

Can you explain ? I thought it would be 15:

Code: Select all: .......1.4.........2...........5.4.7..8...3....1.9....3..4..2...5.1........8.6... ....7...54..2..3...............5...78.....4....1.9....3..4......5......9...8..2..

by **gfroyle** » Thu Aug 09, 2007 9:10 am

ravel wrote:Can you explain ? I thought it would be 15:
Code: Select all
.......1.4.........2...........5.4.7..8...3....1.9....3..4..2...5.1........8.6... ....7...54..2..3...............5...78.....4....1.9....3..4......5......9...8..2..

Ravel is correct.. the Hamming distance is (at most) 15...

If I recall correctly, when gsf posted his original description of his Hamming distance implementation it was a two-step process of choosing a n equivalent puzzle in such a way as to:

- maximise overlap of positions, then
- maximise similarity of symbols in overlapping positions

What Ravel's example shows us is that a smaller overall distance can be obtained WITHOUT MAXIMUM OVERLAP provided you can make up for it with greater symbol similarity... noticed that the first representative overlaps in 11 positions, while the 2nd overlaps in only 10, but almost all of those 10 are "same symbol".

Great example...

Cheers

Gordon

by **gsf** » Thu Aug 09, 2007 2:04 pm

gfroyle wrote:What *I* want to know is more detail about the structure of the graph on the 17s... what are the degrees of the vertices (i.e. the number of neighbours that each puzzle/vertex has), which puzzles have the highest degree etc. (Could it even be that this graph is a scale-free graph in the Barabasi et al sense?)

thanks for the closure clarifications

could you provide a ref with a description of scale-free graph
there's a bunch of refs to scale-free, and its not clear which of those actually defines it

I can see that at some stage I will have to compute this darn graph myself and just confirm your numbers....

you can use my solver as a starting point
for the {-5} (5-off) initial data:

Code: Select all: sudoku -goce{-5} -Fc'%05#in %#kc' g.dat > g-5-1

where g.dat contains the 17s
g-5-1 contains lines with two fields <g-ordinal> <12-clue-minlex-subgrid>
where 12=17-5
this takes ~1h45 min to generate 3.2Ghz

g-5-1 may contain dups (typically too much data for my solver to uniq in core), so

Code: Select all: sort -u g-5-1 > g-5-2

I do one more pass on the data to convert the <12-clue-minlex-subgrid> fields to ordinals
and then use the connected-components or disjoint-set-union algs
g-5-2 is 1.5Gi, a bit to large to post

by **gsf** » Thu Aug 09, 2007 2:08 pm

gfroyle wrote:
ravel wrote:Can you explain ? I thought it would be 15:
Code: Select all
.......1.4.........2...........5.4.7..8...3....1.9....3..4..2...5.1........8.6... ....7...54..2..3...............5...78.....4....1.9....3..4......5......9...8..2..

Ravel is correct.. the Hamming distance is (at most) 15...

If I recall correctly, when gsf posted his original description of his Hamming distance implementation it was a two-step process of choosing a n equivalent puzzle in such a way as to:

- maximise overlap of positions, then
- maximise similarity of symbols in overlapping positions

you recall better than I do -- any of my code >6months old looks like someone else's
(why did they do it that way --- oops, they was me)
the code, or at least the docs, will be corrected
thanks

by **gfroyle** » Fri Aug 10, 2007 12:36 am

gsf wrote:you recall better than I do -- any of my code >6months old looks like someone else's

All that I remembered was a vague comment - when I searched for it, I found the following from a post you made on 25 May.

gsf wrote:also added a hamming-style distance based on the sudoku space thread discussions
the distance is biased towards minimizing clue position differences first, then minimizing cell value differences

Cheers

g

by **gsf** » Fri Aug 10, 2007 2:11 am

gfroyle wrote:
All that I remembered was a vague comment - when I searched for it, I found the following from a post you made on 25 May.

gsf wrote:also added a hamming-style distance based on the sudoku space thread discussions
the distance is biased towards minimizing clue position differences first, then minimizing cell value differences

after looking at the code again I know why I did it that way
matching positions is much easier than matching clue values
I'm hoping there's a better way than running p! combinations on p matching positions,
but I don't see it right now

The New Sudoku Players' Forum

17-clue and 18-clue Sudoku update