I haven't found prior discussion of this question, so to explore it I ran some experiments with random puzzle generation and one-clue relaxation. I was interested to find that the resulting solution counts did not exhibit any of the sort of patterns hypothesized above. Instead they closely follow a simple parametric distribution.
The procedure was to generate about 15 million examples using the process shown in the pseudo-code below:
- Code: Select all
GenerateValidNotNecessarilyMinimalSudoku() {
clues := {}
consequences := {}
for literal in RandomPermutation(Range(729)) {
if literal not in consequences {
switch (CountSolutions(clues U {literal}, limit=2)) {
case 0:
continue
case 1:
return clues U {literal}
case 2:
clues := clues U {literal}
consequences := UnitPropagate(clues)
}
}
}
}
samples = []
for _ in Range(NumSamples) {
clues := GenerateValidNotNecessarilyMinimalSudoku()
relaxed := clues - PickRandomElements(clues, NumToDrop)
samples += [CountSolutions(relaxed, limit=10000)]
}
With these results we can model the empirical distribution of solution counts, and it turns that they fit a lognormal distribution quite well (where we just interpret the lognormal density taken at an integer as a probability mass for that count). Here is a plot showing solution count vs. number of relaxed puzzles with that solution count for the empirical data (black) and the fitted lognormal model (green):
Kind of remarkable I think. The close fit suggests that it's reasonable to think of solution counts as arising from multiplicative contributions of many independent random factors. Actually, maybe that's not so surprising.
Of course, the data above are a mixture of 1-clue relaxations of puzzles with a range of different initial clue counts. If we condition on the initial clue count, then we get widely varying distributions of solution counts under relaxation. As you might expect, relaxed low-clue puzzles tend to have more solutions than relaxed high clue puzzles. But each of these conditional distributions is still approximately lognormal, just with different parameters. Here's how the fitted parameters vary with initial clue count:
These imply that randomly-generated-and-not-necessarily-minimal 23 clue puzzles have approximately mean=148 and median=47 solutions after 1 random relaxation, while similarly generated 31-clue puzzles have approximately mean=2, median=1 solutions after 1 random relaxation (most random puzzles with this many clues have a lot of redundancy).