## 17-clue and 18-clue Sudoku update

Everything about Sudoku that doesn't fit in one of the other sections
ronk wrote:
gsf wrote:
ronk wrote:
gsf wrote:using two different algorithms I just confirmed that G, |G|=41347, is closed under {-5+5}

So unknown 17s have Hamming distance d > 10 from all members of (code) set G ...

... if/when we are able to compute the {-5+5} closure of G

Color me confused, as I thought that's what your first statement above said.

aha
I had to rewrite a few times before sending -- and your confusion confirms I didn't get it quite right
there's too many closures in scope

the computation I did was to take G (the known 17s), do a {-5} (5-off) to derive all 12 clue subgrids,
and then did a closure based on the subgrids using a connected components and disjoint set union algorithms
this particular closure does not generate new 17sbut instead determines paths from any known 17 to any other known 17 via the 12s
this closure yielded one equivalence class
which means that all of the known 17s are in the {-5+5} closure
I think it was coloin that mentioned the possibility of this computation a few posts back

its neat that we can assert this without computing the entire {-5+5} closure

so my previous reply was based on this interpretation of unknown:
since we haven't confirmed the {-5+5} closure of G, some unknowns could be in that closure
there could also be other unknowns outside the closure or the {-5+5} closure could contain all 17s
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

gsf wrote:I had to rewrite a few times before sending -- and your confusion confirms I didn't get it quite right
there's too many closures in scope

Indeed...

Basically the class is CLOSED if it contains *all* 17s that can be obtained from any of the others by a "move" (in your case {-5+5} but the concept is general).

The class is CONNECTED if for any two 17s you can move from one to the other via an entire *sequence* of moves using other 17s as intermediate steps.

What we know from Glenn's work is that this graph is connected, even though we may not know the entire closure... basically at the moment we can view our existing collection of 17s as the "visible" portion of a possibly much larger "submerged" graph with new vertices waiting to be found. Doing a full {-5+5} on any vertex (impossible I know) is equivalent to completely hauling THAT VERTEX and all of its neighbours out of the water, thus possibly revealing new vertices.

What *I* want to know is more detail about the structure of the graph on the 17s... what are the degrees of the vertices (i.e. the number of neighbours that each puzzle/vertex has), which puzzles have the highest degree etc. (Could it even be that this graph is a scale-free graph in the Barabasi et al sense?)

I can see that at some stage I will have to compute this darn graph myself and just confirm your numbers....

Cheers

Gordon
gfroyle

Posts: 214
Joined: 21 June 2005

gsf wrote:
Code: Select all
`distance 20.......1.4.........2...........5.4.7..8...3....1.9....3..4..2...5.1........8.6......2...1.4...3.6...............5.3....1....7.6...4....5.....4...8.7........1...2.`
Can you explain ? I thought it would be 15:
Code: Select all
`.......1.4.........2...........5.4.7..8...3....1.9....3..4..2...5.1........8.6.......7...54..2..3...............5...78.....4....1.9....3..4......5......9...8..2..`
ravel

Posts: 998
Joined: 21 February 2006

ravel wrote:Can you explain ? I thought it would be 15:
Code: Select all
`.......1.4.........2...........5.4.7..8...3....1.9....3..4..2...5.1........8.6.......7...54..2..3...............5...78.....4....1.9....3..4......5......9...8..2..`

Ravel is correct.. the Hamming distance is (at most) 15...

If I recall correctly, when gsf posted his original description of his Hamming distance implementation it was a two-step process of choosing a n equivalent puzzle in such a way as to:

- maximise overlap of positions, then
- maximise similarity of symbols in overlapping positions

What Ravel's example shows us is that a smaller overall distance can be obtained WITHOUT MAXIMUM OVERLAP provided you can make up for it with greater symbol similarity... noticed that the first representative overlaps in 11 positions, while the 2nd overlaps in only 10, but almost all of those 10 are "same symbol".

Great example...

Cheers

Gordon
gfroyle

Posts: 214
Joined: 21 June 2005

gfroyle wrote:What *I* want to know is more detail about the structure of the graph on the 17s... what are the degrees of the vertices (i.e. the number of neighbours that each puzzle/vertex has), which puzzles have the highest degree etc. (Could it even be that this graph is a scale-free graph in the Barabasi et al sense?)

thanks for the closure clarifications

could you provide a ref with a description of scale-free graph
there's a bunch of refs to scale-free, and its not clear which of those actually defines it
I can see that at some stage I will have to compute this darn graph myself and just confirm your numbers....

you can use my solver as a starting point
for the {-5} (5-off) initial data:
Code: Select all
`sudoku -goce{-5} -Fc'%05#in %#kc' g.dat > g-5-1`

where g.dat contains the 17s
g-5-1 contains lines with two fields <g-ordinal> <12-clue-minlex-subgrid>
where 12=17-5
this takes ~1h45 min to generate 3.2Ghz

g-5-1 may contain dups (typically too much data for my solver to uniq in core), so
Code: Select all
`sort -u g-5-1 > g-5-2`

I do one more pass on the data to convert the <12-clue-minlex-subgrid> fields to ordinals
and then use the connected-components or disjoint-set-union algs
g-5-2 is 1.5Gi, a bit to large to post
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

gfroyle wrote:
ravel wrote:Can you explain ? I thought it would be 15:
Code: Select all
`.......1.4.........2...........5.4.7..8...3....1.9....3..4..2...5.1........8.6.......7...54..2..3...............5...78.....4....1.9....3..4......5......9...8..2..`

Ravel is correct.. the Hamming distance is (at most) 15...

If I recall correctly, when gsf posted his original description of his Hamming distance implementation it was a two-step process of choosing a n equivalent puzzle in such a way as to:

- maximise overlap of positions, then
- maximise similarity of symbols in overlapping positions

you recall better than I do -- any of my code >6months old looks like someone else's
(why did they do it that way --- oops, they was me)
the code, or at least the docs, will be corrected
thanks
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

gsf wrote:you recall better than I do -- any of my code >6months old looks like someone else's

All that I remembered was a vague comment - when I searched for it, I found the following from a post you made on 25 May.

gsf wrote:also added a hamming-style distance based on the sudoku space thread discussions
the distance is biased towards minimizing clue position differences first, then minimizing cell value differences

Cheers

g
gfroyle

Posts: 214
Joined: 21 June 2005

gfroyle wrote:
All that I remembered was a vague comment - when I searched for it, I found the following from a post you made on 25 May.

gsf wrote:also added a hamming-style distance based on the sudoku space thread discussions
the distance is biased towards minimizing clue position differences first, then minimizing cell value differences

after looking at the code again I know why I did it that way
matching positions is much easier than matching clue values
I'm hoping there's a better way than running p! combinations on p matching positions,
but I don't see it right now
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

My program needs about the double time than yours.
I go through all (2*6^8) equivalent patterns of one puzzle and - if the number of common cells is small enough for allowing a smaller distance than the best one so far - i try to map as much numbers as possible.

For your other sample i get distance 15:
Code: Select all
`......6.4....81.........7...5.6..8...4...7......2.....2.7....1.1...3.......5...........7.4.6..8.....3.9......516...9..4....2...........2.7....1.............5.4...`
ravel

Posts: 998
Joined: 21 February 2006

ravel wrote:My program needs about the double time than yours.
I go through all (2*6^8) equivalent patterns of one puzzle and - if the number of common cells is small enough for allowing a smaller distance than the best one so far - i try to map as much numbers as possible.

right, the first part looks like a canonicalization loop
but what does i try to map as much numbers as possible look like?
if there are p common positions between the two, how many clue assignment combinations must be tested?
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

I only try placements, that definitely increase the number of common clues. In the above sample this were between [edit:] 208 and 957 single number placements leading to 96-416 mappings (with posssibly less than 9 numbers, then the others can be filled in arbitrarily), when i checked 518 puzzles with 9-11 common cells.

Of course this is much slower for high clue puzzles. When comparing these 38's
Code: Select all
`.........9..26...44.6.8...3....5..1.14..985.25.9.124382.5..6.416.4.213.531.5....6.7.......9.1.87.4.4.6.91.871......2..4.9.2.5.5...1.4..2.5.7...4614.29.7578.15.26.`
i got a distance of 30:
Code: Select all
`.3...5..99..21..3.1.......5.......6.3...685.28.6.524.3..3..69.16.9.2135441.5.9.86`
after checking 67162 puzzles with between 102 and 14261 single placements and 0 to 7342 possible mappings then.
Last edited by ravel on Fri Aug 10, 2007 9:39 am, edited 1 time in total.
ravel

Posts: 998
Joined: 21 February 2006

thanks for the details
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

I tweaked my sudoku17 query script to include per-discoverer counts and rates based on the
time of first and last post
the overall rate since #40095 is 1 new entry every 1h22m
Code: Select all
`          Håvard Graff   354  1h57m          Gordon Royle   339  5h14m                   JPF   161  5h21m                   gsf   132  3h49m         Kohei Noshita    83 17h38m  Lars Petter Endresen    67  3h56m                coloin    46 14h05m          Kevin Burley    44 10h21m            Jim Wilson    29 14h53m                    RW    17  9h52m          Glenn Fowler    11  4d21h             Anonymous     9  1w00d       Mauricio Chacón     8  1d18h        Colin Colville     4  1d18h                           3  1w06d                 ravel     1      -            jim wilson     1      -       Helene Foliguet     1      -                 total  1310  1h22m`
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

Scale-free just means that the distribution of vertex degrees roughly follows a "power law" distribution..

So, lots of vertices of degree 1, fewer of degree 2, and a teeny number of high degree vertices.

Basically try to fit a function of the form

N(k)= A k^-B

to describe the number of vertices of degree k.
gfroyle

Posts: 214
Joined: 21 June 2005

ravel wrote:When comparing these 38's
Code: Select all
`.........9..26...44.6.8...3....5..1.14..985.25.9.124382.5..6.416.4.213.531.5....6.7.......9.1.87.4.4.6.91.871......2..4.9.2.5.5...1.4..2.5.7...4614.29.7578.15.26.`
i got a distance of 30

I think that similarity is a more interesting measure than distance. Now wait! That's not just a "subtract from 81"-type comment. The two measures are quite different creatures (perhaps this has already been mentioned). Let ...
E = nr cells that are empty in both puzzles
S = nr occupied cells that are the same in both puzzles
D = nr occupied cells that are different in both puzzles
T = nr cells that have different type (occupied/empty) in the two puzzles
... so E + S + D + T = 81. You've been working with distance = D + T. Now I can see that that's interesting from an edit-distance / neighbourhood tour point of view, but I say that you should also be interested in the measure similarity = S.

Here's why. We can regard a puzzle as a graph with each vertex = a value in a cell and each edge = relationship between them (e.g. same value, same band; or different value, same row, different stack). Viewed that way, two puzzles are isomorphic iff their graphs are equal (modulo labelling) -- in fact this is how Gordon said that he checks whether or not a 17 is new. If two puzzles are not isomorphic then we should be interested in the size of the largest subgraph (= sub-puzzle) in common to both. That's "similarity" as defined above.

For example, comparing ravel's two 38s:
Code: Select all
`       Target: .........9..26...44.6.8...3....5..1.14..985.25.9.124382.5..6.416.4.213.531.5....6 Permed (new): .3...5..99..21..3.1.......5.......6.3...685.28.6.524.3..3..69.16.9.2135441.5.9.86    distance = 30 ***  similarity = 16       Target: .........9..26...44.6.8...3....5..1.14..985.25.9.124382.5..6.416.4.213.531.5....6 Permed (new): ....6..9.94.2...8...6.8.2...6835.91...5.98..24.9612.585.3.46.2.6.4.21.35...5.....    distance = 31  similarity = 20 ***`
The distance measure in this case says that you can get from one puzzle to (an isomorph of) the other by editing 30 cells; whereas the similarity measure says that, up to (a different) isomorphism, the largest sub-puzzle in common has 20 clues.

Both measures are interesting: we shouldn't just limit ourselves to "distance".

Now a challenge:
• The biggest distance I've seen between a pair of 17s is 16.
• The smallest similarity I've seen between a pair of 17s is 8.
Can anyone do better? (I only checked a smallish random sample.)

~~~

PS: regarding this ...
gsf wrote:I'm hoping there's a better way than running p! combinations on p matching positions,
but I don't see it right now
There is a much better way: it's called the Hungarian (or Munkres') Assignment Algorithm.
Red Ed

Posts: 633
Joined: 06 June 2005

PreviousNext