The New Sudoku Players' Forum

by **tdillon** » Sun Jan 26, 2020 7:20 am

I've been curious, for invalid multi-solution puzzles, whether certain solution counts are more likely than others. For example, motivated by the example of simple deadly patterns like unique rectangles, it's tempting to guess that the solution count is more likely to be even than odd, or maybe that it's more likely to be a power of two.

I haven't found prior discussion of this question, so to explore it I ran some experiments with random puzzle generation and one-clue relaxation. I was interested to find that the resulting solution counts did not exhibit any of the sort of patterns hypothesized above. Instead they closely follow a simple parametric distribution.

The procedure was to generate about 15 million examples using the process shown in the pseudo-code below:

Code: Select all: GenerateValidNotNecessarilyMinimalSudoku() { clues := {} consequences := {} for literal in RandomPermutation(Range(729)) { if literal not in consequences { switch (CountSolutions(clues U {literal}, limit=2)) { case 0: continue case 1: return clues U {literal} case 2: clues := clues U {literal} consequences := UnitPropagate(clues) } } } } samples = [] for _ in Range(NumSamples) { clues := GenerateValidNotNecessarilyMinimalSudoku() relaxed := clues - PickRandomElements(clues, NumToDrop) samples += [CountSolutions(relaxed, limit=10000)] }

With these results we can model the empirical distribution of solution counts, and it turns that they fit a lognormal distribution quite well (where we just interpret the lognormal density taken at an integer as a probability mass for that count). Here is a plot showing solution count vs. number of relaxed puzzles with that solution count for the empirical data (black) and the fitted lognormal model (green):

Kind of remarkable I think. The close fit suggests that it's reasonable to think of solution counts as arising from multiplicative contributions of many independent random factors. Actually, maybe that's not so surprising.

Of course, the data above are a mixture of 1-clue relaxations of puzzles with a range of different initial clue counts. If we condition on the initial clue count, then we get widely varying distributions of solution counts under relaxation. As you might expect, relaxed low-clue puzzles tend to have more solutions than relaxed high clue puzzles. But each of these conditional distributions is still approximately lognormal, just with different parameters. Here's how the fitted parameters vary with initial clue count:

These imply that randomly-generated-and-not-necessarily-minimal 23 clue puzzles have approximately mean=148 and median=47 solutions after 1 random relaxation, while similarly generated 31-clue puzzles have approximately mean=2, median=1 solutions after 1 random relaxation (most random puzzles with this many clues have a lot of redundancy).

by **coloin** » Sun Jan 26, 2020 1:58 pm

Interesting ... and timely as coincidently I have been looking at this (n-1) operation on 18C [n=18] puzzles.
Its important to diferentiate the number of solutions of a sub-puzzle with the number of ways another clue can be added to give a different puzzle [more likely minimal the smaller the n]

Most (n-1) or (relaxed) puzzles are a bit lob sided , they have a few low sol. count clues and tend to have more high sol. clues.

Extreme Hard puzzles tend not to have this lob-sidedness, and their n-1 sub puzzles tend not to have many clues insertable by logic, therefore the solution counts are higher.

Remote puzzles - eg an 18C with no other puzzles within {-1+1} also tend to not have this lob sidedness and they tend not to have a low sol count for any of the clues.

Code: Select all: +---+---+---+ |1..|...|...| |...|2..|...| |...|...|35.| +---+---+---+ |.42|...|...| |...|.5.|.8.| |.9.|.1.|.6.| +---+---+---+ |..7|..4|2..| |3..|..8|...| |...|...|4.9| +---+---+---+

This remote puzzle none within {-1+1}, albeit an easy puzzle, and only 12 puzzles within {-2+2}, removing the 7 at r7c3 has 557 sol [minimum for this puzzle]

Code: Select all: +---+---+---+ |..3|...|...| |45.|...|...| |...|12.|.6.| +---+---+---+ |2.1|.4.|.7.| |.7.|..5|.9.| |...|...|...| +---+---+---+ |...|2..|..4| |.9.|8..|...| |...|...|2..| +---+---+---+

This random 18C , removing the 7 at r5c2 gives 28 sol. a {1+2} on this sub puzzle gives 50 puzzles, a full {-2+2} on the complete puzzle gives 79 puzzles ....

by **tdillon** » Mon Jan 27, 2020 3:32 am

Interesting observations, coloin. If you select puzzles with interesting properties (difficulty, remoteness, etc.) I'm sure the clue count distributions for the relaxed sub-puzzles can be affected significantly. The highly regular behavior I'm observing is surely related to the simple way I'm generating puzzles from a uniformly drawn permutation of literals (admittedly not the same as uniformly drawing from possible puzzles, but at least a simple and repeatable generative process).

BTW, here are the same charts after randomly dropping 2, 3, or 4 clues:

by **dobrichev** » Mon Jan 27, 2020 8:36 am

Hi Tom,

I see no reason to mix the results for different puzzle sizes. Drawing separate function for each size perhaps will reduce the noise by 1 - 2 orders of magnitude.

Inclusion of non-minimal start puzzles adds noise too.

I am almost sure that your puzzle generator is biased to low-clue puzzles. You can try restarting the puzzle from scratch when at predefined number of givens the puzzle is still non-unique, and compare the results.

What about using logarithmic scale on X axis too? And log(CaseCount/AllCount) on Y? Or even log(sum(CaseCount for xx <= X) / AllCount) for Y axis - a value that starts from 0% and asymptotically grows to 100%?

An interesting result, although different from what you are doing, would be comparison of the uniqueness constraint to minimality constraint for different number of clues. For such task maybe the number of restarts of the generator should be counted, and the generated puzzle (with predefined size) to be counted once as unique/non-unique, and once as minimal/non-minimal.

by **m_b_metcalf** » Mon Jan 27, 2020 8:54 am

tdillon wrote:
Code: Select all
samples += [CountSolutions(relaxed, limit=10000)]

Tom, is your limit of 10,000 arbitrary? Here are some ancient posts on megaclues.

Regards,

Mike

14 years ago coloin wrote: ....taking your first grid and removing the 1 at r5c5
Code: Select all
+---+---+---+ |...|...|13.| |...|.8.|..5| |42.|...|...| +---+---+---+ |6..|...|.2.| |..5|...|...| |...|...|4..| +---+---+---+ |.7.|4.2|...| |...|6..|2..| |...|...|..1| +---+---+---+ 1197093 sol. I didnt expect that !

14 years ago Red Ed wrote:That's a nice twist on the theme, coloin.

The search through the 17s for a megaclue is being done by gsf and myself. gsf's search will complete before mine, it seems, and I think he's collecting extra stats besides just the max grid, so you can look forward to an interesting post from him in a few days. In the mean time, here's a taster of what's out there in seventeenland: try removing r8c4 from this ...
Code: Select all
.1.|5..|... ...|.7.|.8. ...|...|..9 ---+---+--- 7..|.2.|4.. 2..|...|5.1 ...|...|... ---+---+--- 6.2|...|.4. ...|1..|3.. ...|9..|...
... you get 11339281 solutions!

10 years ago dobrichev wrote:
This is the list of 16-clue pseudo-puzzles with > 10 000 000 solutions obtained by single clue removal from Gordon's list of 49151 17s

Code: Select all
.................1..2.34..........5....6......7.....34...7...4...6.1....83....2.. 10432440 ..............1..2..3.4..5........6..2...7....7..........6..3...8......49...85... 10789663 ..............1..2..3....4.......1.5......6...26.4.........6......3.....5..78..9. 11077028 ..............1..2..3....4.........5....3.....34.6........7..8..279......5....1.. 11339281 ........1.....2.....3.4..5......61...3......67...5.........73...4.......5..8..... 12040692 ..............1..2..3..4.5........3....26.....6.............7...8.5..4..41......9 12602377 ........1.....2.....3.4..5.........6.27.8.....3....7..........8.8.1.....4....6... 13008797

by **dobrichev** » Mon Jan 27, 2020 9:24 am

Then for completeness, here are some minimal multiple-solution 35s which are expandable to at least one 36+ unique minimal puzzle and have large number of completions. From here.

Code: Select all: ...........1.23.45.42.5..31.........1.46..35.6.753.418.........2.63.518441.26.5.3 35 23268 ...........1.23.45.42.5..31.........1.46..35.6..53.417.........2.63.517441.26.583 35 23268 ...........1.23.45.42.5..31.........1..63.45.6.457.318.........2.63.51.441.26.583 35 23268 ...........1.23.45.42.5..31.........1..63.45.6.45..317.........2.63.518441.26.573 35 23268 .............12.34.34.56712...........32.814.4.8..132...........8512.46334..85271 35 21784 ...........1.23.45.42.5..31.........1.46..35.6.753.41..........2.63.518441.26.573 35 19896 ...........1.23.45.42.5..31.........1..63.45.6.457.31..........2.63.517441.26.583 35 19896 ...........1..2.34.35.46721...........348.21.1.8.2.34...........862.417331..68452 35 19296 .............12.34.34..5612...........32.714.4.7.5132...........7812.45334..78261 35 19284

by **tdillon** » Mon Jan 27, 2020 10:22 am

dobrichev wrote:I see no reason to mix the results for different puzzle sizes. Drawing separate function for each size perhaps will reduce the noise.

Hi Mladen, I agree that if we aspire to some practical utility, like the ability to predict how many solutions to expect when dropping clue from a given puzzle, then distributions conditioned on clue count are certainly much better. But here I was just exploring the fit, and interested to observe that the lognormal behavior obtains for various mixtures, and continues to fit well across several orders of magnitude of solution counts.

dobrichev wrote:Inclusion of non-minimal start puzzles adds noise too.

Starting from only minimial puzzles will actually lessen the lognormal fit since there will be no samples with solution counts of 1. It's kind of interesting that the lognormal model predicts so well the fraction of puzzles that are minimal given the way I've sampled them.

dobrichev wrote:I am almost sure that your puzzle generator is biased to low-clue puzzles.

I think you're right about the low-clue bias. I was aware that uniformly sampling permutations is not the same as uniformly sampling possible puzzles, but I hadn't thought until now about exactly how these things are different. While each permutation maps to exactly one puzzle, a given puzzle can arise from many different permutations. It's difficult to enumerate exactly, but roughly speaking for an N-clue minimal puzzle you can pick any of the N! permutations of the clues and any of the (729-N)! permutations of the other literals, and you can produce a total order by interleaving them in any way consistent with the constraints that the first literal is the first clue and every non-clue comes after the clues that imply it or negate it. The interleaving part is hard to evaluate, and maybe a wash, but the first part definitely favors low clue puzzles. e.g., ignoring the interleaving part there are 40x as many ways to pick pairs of clue/non-clue permutations for 17-clue puzzles as there are for 18-clue puzzles:

Code: Select all: > f <- function(n) exp(lgamma(n+1)+lgamma(729-n+1)-lgamma((n+1)+1)-lgamma(729-(n+1)+1)) > f(17:24) [1] 39.55556 37.42105 35.50000 33.76190 32.18182 30.73913 29.41667 28.20000

dobrichev wrote:What about using logarithmic scale on X axis too? And log(CaseCount/AllCount) on Y? Or even log(sum(CaseCount for xx <= X) / AllCount) for Y axis - a value that starts from 0% and asymptotically grows to 100%?

I initially looked at the log/log scale when exploring power-law distributions. But the power-law fit is not as good. It might be worth exploring other things, but I stopped at the lognormal because the fit is so good and because the generative model is plausible.

by **tdillon** » Mon Jan 27, 2020 10:25 am

m_b_metcalf wrote:Tom, is your limit of 10,000 arbitrary? Here are some ancient posts on megaclues.

Wow! The megaclue puzzles are impressive!

Yes, the limit was arbitrary. I was trying to get lots of samples without too much runtime, so I didn't want to get bogged down spending a lot of time on the slow ones, especially when the distribution will be too undersampled at really high solution counts to be useful in fitting the model.

by **tdillon** » Mon Jan 27, 2020 10:29 am

Looking back I realize I didn't quite describe my generation process correctly. As described above it will add a literal as a clue as long as it isn't inconsistent with what's been added before, which would tend to produce lots of redundancy, but in practice I don't add clues that the solver already recognizes as consequences of previous ones. [edited pseudocode above to reflect this]

by **coloin** » Wed Jan 29, 2020 10:03 am

Indeed true random solution grid generation and true random puzzle generation has taxed many learned members over the years.

puzzles for amusement !

However, minimality and size of puzzle has a very much more direct effect on your results whether removing one , or more, clues, as mentioned.

Remoteness of puzzles, is probably only an observable phenomenen in 21 clue puzzles and below. We know these puzzles are on the edges of the "sudoku puzzle space universe", but at least it is finite !!

Looking at the 17 puzzles - paradoxically many of them are easy and/because they tend to be lobsided [ cf the megacclue] . In many of them their 14 clue subpuzzles have a surprisingly low solution count and have those infered [ singles] clues which partially complete the puzzle. As they have easy clues inserted early on ... these puzzle tend to be easier to solve later on.
I would imagine your data would show differences in megaclue puzzles, remote puzzles, hard and easy puzzles - when compared to all 17C puzzles.

by **tdillon** » Wed Jan 29, 2020 5:51 pm

Hi coloin, thanks for the link. Lots of interesting things to absorb there!

The New Sudoku Players' Forum

Solution counts after random generation, 1 clue relaxation

Solution counts after random generation, 1 clue relaxation

Re: Solution counts after random generation, 1 clue relaxati

Re: Solution counts after random generation, 1 clue relaxati

Re: Solution counts after random generation, 1 clue relaxati

Re: Solution counts after random generation, 1 clue relaxati

Re: Solution counts after random generation, 1 clue relaxati

Re: Solution counts after random generation, 1 clue relaxati

Re: Solution counts after random generation, 1 clue relaxati

Re: Solution counts after random generation, 1 clue relaxati

Re: Solution counts after random generation, 1 clue relaxati

Re: Solution counts after random generation, 1 clue relaxati