coloin wrote:With canicalization gsf can get ...

what kind of strange animal can he get ?

JPF

gfroyle wrote:gsf wrote:so the equivalence classes are easy to derive

but this does not guarantee that all {-2+2} members are represented in each class

that would take the longer computation of generating each class by applying {-2+2}* to one member of each class

Seems too fast to me....

Let me clarify my understanding of what is happening

(1) Given a puzzle P, you can generate the list of 136 15-clue pseudo-puzzles - call that L(P)

(2) Two puzzles P and Q are DIRECTLY {-2+2} related if and only if L(P) and L(Q) contain at least one pseudo-puzzle in common

But now I cannot see how you can quickly compute equivalence classes from this information ... essentially we need to know all the DIRECT relationships (which seems to require checking L(P_i) against L(P_j) for all i \not= j)....

I'm sure I have missed something, probably obvious...

thanks for the back of the envelope skepticism

I had taken too simple an approach and did not reach closure

I added a check pass after the computation loop to verify your suspicions

basically one pass over the data gave a distance 1 approximation to the true closure

by making the first pass iterative the approximation from each pass is refined

15 rounds of refinement finally reached closure for |G| = 41176 puzzles in

real 1h21m50.23s

user 1h21m34.99s

sys 0m14.77s

this was done by a ksh script -- each round does lookups from scratch so it could be improved

the equivalence class sizes confirm some suspicions I have about some of the alternate

background processes I have running (which have all been stuck, apparently on the same class,

attempting to generate the closure of that one class)

the 1h21m computation shows why

here is a table of the #classes #members-in-class for the first 41176 puzzles in G

- Code: Select all
`8507 1`

2665 2

808 3

490 4

232 5

155 6

82 7

78 8

42 9

39 10

13 11

23 12

13 13

9 14

10 15

14 16

5 17

5 18

5 19

2 20

3 21

3 22

3 23

2 24

4 25

3 26

2 27

1 28

2 30

1 31

1 32

1 33

1 35

1 41

2 44

2 46

1 47

1 61

1 73

1 77

1 81

1 16244

that last whopper, 1 class containing 16244 puzzles, is at the root of the direct closure stalls

looks like we have Pangaea and a bunch of micro islands, most of which have equivalence class size 1

this is for the {-2+2} data

I'll see if crunching the {-3+3} data is tractable

initial guess is it should take 10x longer

- gsf
- 2014 Supporter
**Posts:**7306**Joined:**21 September 2005**Location:**NJ USA

Indeed, well done.

These results are of course on the known 17 puzzles.

Which of the classes have been demonstrated to be closed ?

I am suspicious of some of the 17s #35000-#40000 which i am sure have not been subjected to a full{-2+2}. These will come out in the wash in due course !

However, I am still trying to understand why there should be 81 17-puzzles with over 10000 15-subpuzzles.....and not one is the same as the 2000000 different 15-subpuzzles in our Pangaea [P] class.

C

These results are of course on the known 17 puzzles.

Which of the classes have been demonstrated to be closed ?

I am suspicious of some of the 17s #35000-#40000 which i am sure have not been subjected to a full{-2+2}. These will come out in the wash in due course !

However, I am still trying to understand why there should be 81 17-puzzles with over 10000 15-subpuzzles.....and not one is the same as the 2000000 different 15-subpuzzles in our Pangaea [P] class.

C

- coloin
**Posts:**1638**Joined:**05 May 2005

JPF wrote:I'd love to see one of this equivalence class ; the one with 81 puzzles ?

sorry for the delay

I had composed the reply and apparently forgot to submit

the {1}* ({-1+1} closure) and {2}* results for G index 1 through 41176 are posted at G.1c.gz and G.2c.gz

the line format is

- Code: Select all
`puzzle G-index class-index`

where G-index is the index assigned by Gordon and class-index is the min G-index of all members in the class

the files are sorted by < class-index G-index >

this unix pipeline lists the size of each class

- Code: Select all
`cut -d' ' -f3 G.2c | sort | uniq -c | sort -n`

which shows that the class with 81 members is 06892

this unix command lists that class

- Code: Select all
`grep ' 06892$' G.2c`

kjellfp wrote:How about {1}* ? Does it also contain a Pangaea, or is it split into several continents?

here are the #classes and class-size for {1}* -- mostly isolated

- Code: Select all
`19924 1`

5013 2

1078 3

622 4

236 5

133 6

75 7

50 8

32 9

23 10

21 11

19 12

9 13

7 14

1 15

7 16

2 17

2 18

3 19

4 20

1 21

2 22

2 23

2 24

1 26

3 27

1 28

1 29

2 30

1 33

1 37

1 45

1 53

3 60

1 87

1 257

computed by

- Code: Select all
`cut -d' ' -f3 G.1c | sort | uniq -c | sort -n | sed -e 's/^ *//' -e 's/ .*//' | uniq -c`

- gsf
- 2014 Supporter
**Posts:**7306**Joined:**21 September 2005**Location:**NJ USA

a few pm's and double/triple checking show that the equivalence classes I posted were

based on data with some missing connections

which resulted in too many classes, i.e., some classes I labelled as distinct are in reality one class

(some inappropriate {-n+m} pruning slippped into the {-n} subgrid generation)

I believe that the raw closure code in my solver is correct

its just data derived from that that has the problem

I'll post data corrections by tomorrow

and also the methodolgy so the results can be replicated -- or refuted

this is an interesting offshoot because data size for {-3+3} and up is large enough to require

external store or other space saving algorithm adjustments

based on data with some missing connections

which resulted in too many classes, i.e., some classes I labelled as distinct are in reality one class

(some inappropriate {-n+m} pruning slippped into the {-n} subgrid generation)

I believe that the raw closure code in my solver is correct

its just data derived from that that has the problem

I'll post data corrections by tomorrow

and also the methodolgy so the results can be replicated -- or refuted

this is an interesting offshoot because data size for {-3+3} and up is large enough to require

external store or other space saving algorithm adjustments

- gsf
- 2014 Supporter
**Posts:**7306**Joined:**21 September 2005**Location:**NJ USA

well the update to the equivalence partitioning has taken longer than expected

the data sizes require a bit more than brute force

it turns out that determining equivalence classes is equivalent to finding connected components in a graph,

and that is linear time complexity

what took so long is that for |G| = 41324 the {4}* ({-4+4} closure) defines a graph with 98,351,120 edges,

with a graph data representation (using at&t's graphviz tools) file size of 8.1Gi

linear or not, that size input can bog down a connected components implementation

with the right massaging and a home brew algorithm I was able to compute the closure in 1 min

40min to compute the {4}* data using my solver, 16min to massage, 1min to compute connected components

graphviz ccomps took 72min on the massaged data

you can generate the initial data with my solver version 2007-08-01

(41324 input 17's produces 8.1Gi edge data -- change the 4 to 1,2,3 for different on/off edge data)

the graphviz ccomps command and my home brew agree for {1}*, but it would be good to get

independent corroboration on the equivalence class counts and sizes

here are the #classes/class-size for {1}* {2}* {3}* and {4}*:

e.g., for {4}* there are 12 classes of size 1, 3 classes of size 2, and one class of size 41306

so what does this mean

first of all, the data posted above is empirical, based on the 17 catalog G with |G|=41324

the {4}* data shows that the catalog as it stands looks like it may be closed under {-4+4}

i.e., every puzzle in G may be within {-4+4} of every other puzzle

this could be proven by example by selecting a puzzle from the size 41306 class and

doing a {-4+4} closure and seeing if the other classes are absorbed (would that we could compute that expediently)

on the tractable side, it might mean that starting with puzzles in the other 15 classes might produce one or

more new 17's that bridge to the pangaea class

here are the puzzle indices and equivalence class label (lowest puzzle index in the class)

for the {4}* non-pangaea classes:

e.g., { 40609 40610 } are in the 40609 {4}* class and 41100 is in a singleton class

the data sizes require a bit more than brute force

it turns out that determining equivalence classes is equivalent to finding connected components in a graph,

and that is linear time complexity

what took so long is that for |G| = 41324 the {4}* ({-4+4} closure) defines a graph with 98,351,120 edges,

with a graph data representation (using at&t's graphviz tools) file size of 8.1Gi

linear or not, that size input can bog down a connected components implementation

with the right massaging and a home brew algorithm I was able to compute the closure in 1 min

40min to compute the {4}* data using my solver, 16min to massage, 1min to compute connected components

graphviz ccomps took 72min on the massaged data

you can generate the initial data with my solver version 2007-08-01

- Code: Select all
`sudoku -goce{-4} -Fc'%05#in %#kc' g.dat > g.edges`

(41324 input 17's produces 8.1Gi edge data -- change the 4 to 1,2,3 for different on/off edge data)

the graphviz ccomps command and my home brew agree for {1}*, but it would be good to get

independent corroboration on the equivalence class counts and sizes

here are the #classes/class-size for {1}* {2}* {3}* and {4}*:

- Code: Select all
`=== 41324 {1}* ===`

18950 1

4306 2

1095 3

645 4

259 5

170 6

97 7

59 8

47 9

30 10

26 11

18 12

16 13

8 14

3 15

9 16

5 17

5 18

4 19

4 20

1 21

3 22

1 23

5 24

5 25

2 27

1 28

1 30

2 31

2 32

1 33

1 37

1 38

2 41

1 42

1 50

1 55

1 87

1 173

1 560

1 616

=== 41324 {2}* ===

7734 1

2214 2

724 3

424 4

213 5

135 6

70 7

65 8

39 9

29 10

17 11

17 12

12 13

7 14

5 15

12 16

5 17

4 18

7 19

2 20

1 21

3 22

6 23

4 24

1 25

1 26

1 28

2 30

1 31

1 32

1 33

3 35

1 39

1 43

1 44

1 56

1 58

1 64

1 65

1 70

1 80

2 82

1 19182

=== 41324 {3}* ===

1205 1

240 2

62 3

38 4

23 5

16 6

5 7

1 8

2 9

1 10

2 11

1 12

1 13

1 22

1 38950

=== 41324 {4}* ===

12 1

3 2

1 41306

e.g., for {4}* there are 12 classes of size 1, 3 classes of size 2, and one class of size 41306

so what does this mean

first of all, the data posted above is empirical, based on the 17 catalog G with |G|=41324

the {4}* data shows that the catalog as it stands looks like it may be closed under {-4+4}

i.e., every puzzle in G may be within {-4+4} of every other puzzle

this could be proven by example by selecting a puzzle from the size 41306 class and

doing a {-4+4} closure and seeing if the other classes are absorbed (would that we could compute that expediently)

on the tractable side, it might mean that starting with puzzles in the other 15 classes might produce one or

more new 17's that bridge to the pangaea class

here are the puzzle indices and equivalence class label (lowest puzzle index in the class)

for the {4}* non-pangaea classes:

- Code: Select all
`00506 00506`

01467 01467

01835 01835

12703 12703

14037 14037

20069 20069

21276 21276

22575 22575

30563 30563

30564 30563

32043 32043

40609 40609

40610 40609

40611 40611

40734 40734

41077 41077

41214 41077

41100 41100

e.g., { 40609 40610 } are in the 40609 {4}* class and 41100 is in a singleton class

- gsf
- 2014 Supporter
**Posts:**7306**Joined:**21 September 2005**Location:**NJ USA

an update on the order of the closure for G

a colleague introduced me to a linear time (in number of edges) algorithm for connected components

and another related algorithm for disjoint set union (the basic algorithms behind the closure computations)

using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}

the {5}* data took 1h44m to generate, the size is 21Gi

eliminating dup records took another 53m and brought it down to 1.5Gi:

104557112 edges, 36151486 inequivalent 12-clue subgrids (treated as vertices), and 41347 inequivalent grids (also treated as vertices)

both algorithms took 1m13s to compute the closure, resulting in 1 equivalence class

remarkable that two different algorithms, one linear, and one (possibly) quadratic in the worst case, matched runtimes so closely

this doesn't necessirily mean that the set of all 17 sudoku is closed under {-5+5}

there could be another 17 or group of 17's, yet to be discovered, in a separate partition

do we have an idea of the number of inequivalent 12-clue subgrids?

that may give a handle on how close the 41347 (which cover 36151486) are to being complete

a colleague introduced me to a linear time (in number of edges) algorithm for connected components

and another related algorithm for disjoint set union (the basic algorithms behind the closure computations)

using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}

the {5}* data took 1h44m to generate, the size is 21Gi

eliminating dup records took another 53m and brought it down to 1.5Gi:

104557112 edges, 36151486 inequivalent 12-clue subgrids (treated as vertices), and 41347 inequivalent grids (also treated as vertices)

both algorithms took 1m13s to compute the closure, resulting in 1 equivalence class

remarkable that two different algorithms, one linear, and one (possibly) quadratic in the worst case, matched runtimes so closely

this doesn't necessirily mean that the set of all 17 sudoku is closed under {-5+5}

there could be another 17 or group of 17's, yet to be discovered, in a separate partition

do we have an idea of the number of inequivalent 12-clue subgrids?

that may give a handle on how close the 41347 (which cover 36151486) are to being complete

- gsf
- 2014 Supporter
**Posts:**7306**Joined:**21 September 2005**Location:**NJ USA

Great work gsf, very impressive!

I know what you are thinking, but I can shake the feeling that we still don't know how many 17 can come out of the 36151486 subgrids. I am sure that if you did a "5on" on all those grids (would take forever, i know) would produce a lot more 17 than we already have. And since we are not aware of the "completeness" of our 17 collection in relation to the 12-subgrids, I am not sure a number of "total possible 12 subgrids" would be able to tell us much?

The same calculation done for 15 subgrids would be much more accurate, since we are pretty sure that we have all 17 that can come out of them. And calcultating the total number of 15 subgrids could even be easier then 12, since a lot more restrictions in constructing it will apply?

But these are just rantings... Great work again!

Havard

gsf wrote:do we have an idea of the number of inequivalent 12-clue subgrids?

that may give a handle on how close the 41347 (which cover 36151486) are to being complete

I know what you are thinking, but I can shake the feeling that we still don't know how many 17 can come out of the 36151486 subgrids. I am sure that if you did a "5on" on all those grids (would take forever, i know) would produce a lot more 17 than we already have. And since we are not aware of the "completeness" of our 17 collection in relation to the 12-subgrids, I am not sure a number of "total possible 12 subgrids" would be able to tell us much?

The same calculation done for 15 subgrids would be much more accurate, since we are pretty sure that we have all 17 that can come out of them. And calcultating the total number of 15 subgrids could even be easier then 12, since a lot more restrictions in constructing it will apply?

But these are just rantings... Great work again!

Havard

- Havard
**Posts:**377**Joined:**25 December 2005

Havard wrote:gsf wrote:do we have an idea of the number of inequivalent 12-clue subgrids?

that may give a handle on how close the 41347 (which cover 36151486) are to being complete

I know what you are thinking, but I can shake the feeling that we still don't know how many 17 can come out of the 36151486 subgrids. I am sure that if you did a "5on" on all those grids (would take forever, i know) would produce a lot more 17 than we already have. And since we are not aware of the "completeness" of our 17 collection in relation to the 12-subgrids, I am not sure a number of "total possible 12 subgrids" would be able to tell us much?

basically some back of the envelope thoughts

take it to extremes

suppose there were exactly 36151486 12 clue subgrids

then we would know that there is only 1 {-5+5} 17 clue equivalence class

now in the continuum of estimates for the number of 12 clue subgrids

at some point, maybe some order of magnitude times 36151486, we might be swayed to think there were more than one {-5+5} equivalence class

so I'm just grabbing for any hints on how to direct the 17 search

- gsf
- 2014 Supporter
**Posts:**7306**Joined:**21 September 2005**Location:**NJ USA

gsf wrote:using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}

So unknown 17s have Hamming distance d > 10 from all members of (code) set G. That would seem to be a rather significant result. Are there any pairs of set members with d = 34?

- ronk
- 2012 Supporter
**Posts:**4764**Joined:**02 November 2005**Location:**Southeastern USA

ronk wrote:gsf wrote:using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}

So unknown 17s have Hamming distance d > 10 from all members of (code) set G ...

... if/when we are able to compute the {-5+5} closure of G

ronk wrote:Are there any pairs of set members with d = 34?

this computes hamming distance in my solver:

- Code: Select all
`sudoku -C p1.dat p2.dat`

I believe its correct, but its fairly compute intensive, so computing all 40K*40K will take too long

I checked the distance between the {4}* outliers and the max is

edit a subsequent post noted a problem with the distance algorithm -- it has been fixed and resultes corrected here

- Code: Select all
`distance 16`

......6.4....81.........7...5.6..8...4...7......2.....2.7....1.1...3.......5.....

..3...........1...8....97.....6..8...4.....5..1.......2.7...91....4..6.....5.....

the largest known distance in G (a feeble search, only 2299658 puzzle pairs)

edit feeble search data elided because it was based on a flawed distance measure

Last edited by gsf on Thu Aug 09, 2007 11:19 am, edited 1 time in total.

- gsf
- 2014 Supporter
**Posts:**7306**Joined:**21 September 2005**Location:**NJ USA

gsf wrote:ronk wrote:gsf wrote:using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}

So unknown 17s have Hamming distance d > 10 from all members of (code) set G ...

... if/when we are able to compute the {-5+5} closure of G

Color me confused, as I thought that's what your first statement above said.

- ronk
- 2012 Supporter
**Posts:**4764**Joined:**02 November 2005**Location:**Southeastern USA