## Need help in some probability calculations

Everything about Sudoku that doesn't fit in one of the other sections

### Need help in some probability calculations

I am trying to code a faster Grid Checker, similar to the one at http://www.math.ie/checker.html
The goal is finding all valid puzzles with a given number of clues within a given solution grid.

Here is the algorithm:
1) Find several thousands of unavoidable sets (UA).
2) Select a leading sequence of cells based on disjoint UA. This is [Issue 1].
3) Remap the grid to the selected sequence.
4) Perform a full enumeration of puzzles using some optimizations by UA.
5) Check for uniqueness all the puzzles which passed the optimization test.

Details.

Enumeartion is performed by placing all the givens on the rightmost positions, then moving them (traversing trough all possible combinations) to the leftmost positions.

Basic (naive) approach example for 5 givens

Code: Select all
`Possible positions for the leftmost clue.............................................................................0000The rightmost 5 - 1 = 4 cells are reserved for the rest (less significant) clues.Initial position of the leftmost clue00000000000000000000000000000000000000000000000000000000000000000000000000001xxxxFinal position of the leftmost clue1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxThe positions marked with "x" are available for the rest of the clues.Start of enumeration000000000000000000000000000000000000000000000000000000000000000000000000000011111000000000000000000000000000000000000000000000000000000000000000000000000000101111000000000000000000000000000000000000000000000000000000000000000000000000000110111...111101000000000000000000000000000000000000000000000000000000000000000000000000000111110000000000000000000000000000000000000000000000000000000000000000000000000000End of enumeration`

Optimization 1
Order the UA by the rightmost cell position and move each clue to hit the rightmost cell of the first not hit UA set.

Example after Optimization 1
Code: Select all
`Ordered list of the UA not hit000001000100010001000001000001000000000000000000000000000000000000000000000000000000100100010001000100000100110001000000000000000000000000000000000000000000000000...Leftmost clue is placed at the rightmost position of the first UA.000001000100010001000001000001000000000000000000000000000000000000000000000000000 #topmost UA00000000000000000000000000000000000000000000000000000000000000000000000000001xxxx #non-optimized000000000000000000000000000001xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx #optimized`

Optimization 2
Since the order of the enumeration is independent of the order of the cells in the grid, it is possible to rearrange (map) the cell indexes in some more convenient order.
Currently the cells are ordered by the first of the Maximal Disjoint Sets having minimum combinations.
In other words, a list of Maximal Disjoint Sets is generated (Maximal Cliques). If there is more than one such set, the number of combinations of each set is calculated (product of the number of cells in each UA) and one with minimum combinations is chosen. This is not neccesarily the best choice [Issue 2].
The cells of the smaller UA are placed at right, respectively the largest UA is placed at the leftmost.
A maximal position for each of the clues is calculated, so that at right at this position there are enough clues to hit the disjoined UA sets.

Example after Optimization 2
Let maximal disjoint set consist ot U4+U4+U6+U40, MCN=4, 54 cells occupied.
Remapping is done so the remapped cel positions become in this order:
Code: Select all
`<...rest of the cells...><U40 cells><U6 cells><U4 cells><U4 cells>`

For clue 1 (rightmost one) the limit is 3 (the valid position range is [0,1,2,3].
For clue 2 the leftmost possible position is 3 + 4 = 7.
For clue 3 the position limit is 7 + 6 = 13.
For clue 4 the limit is 13 + 40 = 53.
For clue 5 there is no limit (= 80).

TheÑ€e are "degrees of freedom" currently not used in this optimization.
It is obvious that for maximum benefit from Optimization 2, as small as possible and as much as possible UA should be placed at right.
Is it better to choose the Maximal Disjoint Set for the mapping, or some non-maximal but with more small sized UA? [Issue 1]
If there are more than one similar Disjoint Set candidates, how to weight them and choose the best one? [Issue 2]
How the UA of equal size to be ordered while determining the mapping? [Issue 3]
How the cells within each UA to be ordered? [Issue 4]
How the remaining cells (these that are not members of the chosen Disjoint Set) to be ordered? [Issue 5]

One way to resolve the issues above is to collect the possible candidates and to rate them by the estimated benefit.

For example it is easy to calculate the benefit of upper limit for each optimized step.
Each skipped due to the optimization 2 position of the rightmost clue is weighted by 1.
The second clue steps are weighted by size of the rightmost (smallest) UA, in our case by 4.
The third's weight is a product of the second and next UA size, in our case 4 * 4 = 16.
Next is 16 * 6 = 96, next is 96 * 40 = ...

I am at a loss in estimating the frequency of occurence of each of these cases.
The frequency of Optimization 2 (left limit reached) depends on Optimization 1 (right limit), which depends on reordering (Optimization 2) and threfore must be averaged.

So, any help in calculating the probabilities how many times in average each of Optimization 2 cases will occur is welcome.
Let we have a list of Disjoint Sets. My question is how to estimate the optimization benefit for each of the members, and choose the best one?

I tried to collect some statistics empirically, and conclude that
a) Optimized steps distribution depends of the grid and the number of the clues.
b) Scanning for less clues is faster but not representative (couldn't be simply scaled for more clues).
c) First few minutes of scanning are not representative of the whole scan.
d) Scanning itself is slow.

I hope the method in general is not stillborn. It gives good results for some of the grids. For example, on a 2.8GHz PC:

Code: Select all
`439286157176593284258147396381754629795632841624918735542371968867429513913865472 #MCN14 with 17Checked for 17, generated 274 puzzles, found 1 valid.Iterations done in 256.454 seconds123456789456789123798132546237915468864273915915648237342567891581394672679821354 #MCN10 with 9 17sChecked for 17, generated 921280 puzzles, found 9 valid.Iterations done in 94514.610 seconds (=26h15')123456789456789123789123456231564897564897231897231564312645978645978312978312645 #Most CanonicalChecked for 17, generated 0 puzzles, found 0 valid.Iterations done in 1.640 seconds (+ ~15 minutes in enumerating the 495396 maximal cliques)123456789457893612986217354274538196531964827698721435342685971715349268869172543 #Max Minlex GridChecked for 17, generated 0 puzzles, found 0 valid.Iterations done in 182.563 seconds123456789456789123789123465238591647617348952945267318374815296591632874862974531 #MCN 12 with 39Checked for 17, generated 1910 puzzles, found 0 valid.Iterations done in 12359.031 seconds (=3h26')123456789457189326689327154231645897745891632896732541318264975574918263962573418 #Pt MCN15Checked for 20, generated 20234 puzzles, found 6 valid.Iterations done in 500.563 seconds594231786786945132123768954965173248378492561241856397432619875619587423857324619 #MCN12 with 20 17sChecked for 17, generated 11784 puzzles, found 20 valid.Iterations done in 1420.218 seconds (another 2 attempts with different UA collections gave 1003" and 1825")`

Thanks,
MD
dobrichev
2016 Supporter

Posts: 1777
Joined: 24 May 2010

### Re: Need help in some probability calculations

How about one of the methods from http://www.cse.unsw.edu.au/~tw/comic-2006-004.pdf ?
Red Ed

Posts: 633
Joined: 06 June 2005

### Re: Need help in some probability calculations

Red Ed wrote:How about one of the methods from http://www.cse.unsw.edu.au/~tw/comic-2006-004.pdf ?

Thank you for the article.
IMO it is too complex to adapt these methods to the algorithm, or the algorithm to the methods. A simple if...then...else is easy to implement, but our case is min(pos_by_left_clue, pos_by_dj_set). This truncates larger and more complicated branch of the search space which additionally have to be scaled.
But yet reading the paper i got some new idea. I counted the skipped (optimized) steps for each clue, but should count the proportion between skipped and executed steps.
Dynamic switching between 2 mapping strategies is inappropriate - the once remapped space is searched consecutively.
Probing into the search space at random positions is feasible, except when there are too much mapping choices.
Having no criteria to identify the cells within an UA set makes impossible knowledge collected from one grid to be applied to another. So such ordering (if it matters at all) should be done by T&E for each grid [Issues 2 to 5].
I suppose [Issue 1] should have a general solution applicable to all grids.
dobrichev
2016 Supporter

Posts: 1777
Joined: 24 May 2010

### Re: Need help in some probability calculations

Excellent work in attempting to and succeeding in improving on "checker"

I have often wondered how and if "checker" could be improved on - certainly your programming skills are what at least be required.

My thoughts have been - and apologies if these are already considered

1. using the solver of btturner
2. the consideration of two or more disjoint maximal sets of unavoidables [simultaneously]
3. using the information that "certain patterns dont have puzzles" - eg there must be a least one clue in the first two rows of a minlex puzzle. [a u18]. Also many other patterns.....
4.Taking this furthur - in some u18s in certain bands 2 clues are always required in the first 2 rows.
Code: Select all
`+---+---+---+|12.|...|...||...|...|...||987|213|564|+---+---+---+|251|734|698||374|689|215||698|125|347|+---+---+---+|415|867|932||769|342|851||832|591|476|+---+---+---+ `
none of the possible solution grids can be solved with 1 clue in the first 2 rows. There is always 2 unavoidable sets in this instance.
For a repeating minirow band of 27 clues - in the MC grid needs at least 3 clues for each of the 3 u18s.
5. Most bands [of 27clues] needs at least 3 clues in each band [u27] - actually at least 6 in the mc grid. [Actually only 1 27-clue band can be solved with 2 clues]

Summerising 345
we have 36 u18s in a grid which need 1,2,or rarely 3 clues.
we have 6 u27s in a grid which need commonly at least 3 clues - occaisionally require only 2, but may need 4,5,6

6. ? others - we never understood why checker was quick in some grids with the same MCN
7. need 8 clues, one of each digit

C
Last edited by coloin on Thu Jun 17, 2010 12:34 pm, edited 1 time in total.
coloin

Posts: 1864
Joined: 05 May 2005

### Re: Need help in some probability calculations

coloin wrote:...
1. using the solver of btturner
2. the consideration of two or more disjoint maximal sets of unavoidables [simultaneously]
3. using the information that "certain patterns dont have puzzles" - eg there must be a clue in the first two rows of a minlex puzzle. [a u18]. Also many other patterns.....
4. A canonical [repeating miniow] band [of 27 clues] needs at least 2 clues in each of its 3 u18s
5. ? others - we never understood why checker was quick in some grids with the same MCN
6. need 8 clue of each digit

1. The solver is consuming ~0.06% of CPU time. On my machine it works faster than Brian's. Additionally I am exploiting the fact that one of the solutions is known (the original grid), guessing in the right way, which gives extremely fast result for first few solutions.
2. It is still unexplored area. One easy approach is to intersect the disjoint maximal sets and somehow to obtain non-trivial information (beside the fact that shorter UA participate in more DJ).
3. Some of them are UA sets. Others cause insignificant narrowing of the search space.
4. Such structures are too rare. Even in most symmetrical grids some of them are useful, but most not.
Code: Select all
`123456789456789123789123456231564897564897231897231564312645978645978312978312645 #Most CanonicalGroup of 9 DJ sets, each requiring exactly 2 clues = 18 clues, useful{73,76,79,83,86,89,93,96,99} (2,9){72,75,78,82,85,88,92,95,98} (2,9){71,74,77,81,84,87,91,94,97} (2,9){43,46,49,53,56,59,63,66,69} (2,9){42,45,48,52,55,58,62,65,68} (2,9){41,44,47,51,54,57,61,64,67} (2,9){13,16,19,23,26,29,33,36,39} (2,9){12,15,18,22,25,28,32,35,38} (2,9){11,14,17,21,24,27,31,34,37} (2,9)Plenty of DJ sets requiring exactly 2 clues. Large and therefore useless.{46,49,56,57,58,64,65,69,75,78,85,87,89,94,96,98} (16){45,47,49,54,56,58,65,68,74,78,79,85,86,87,94,97} (16){45,46,47,54,57,64,68,69,74,75,79,86,89,96,97,98} (16){45,46,47,49,54,56,57,58,64,65,68,69,74,75,78,79,85,86,87,89,94,96,97,98} (size=24){46,47,48,53,54,55,61,62,69,75,77,79,82,84,86,91,93,98} (18){14,18,19,21,25,26,32,33,37,75,77,79,82,84,86,91,93,98} (18){14,18,19,21,25,26,32,33,37,46,47,48,53,54,55,61,62,69} (18){14,18,19,21,25,26,32,33,37,46,47,48,53,54,55,61,62,69,75,77,79,82,84,86,91,93,98} (size=27)123456789457893612986217354274538196531964827698721435342685971715349268869172543 #Max MinlexGroup requiring 15 clues{73,76,77,83,86,87} (1,6){72,75,79,82,85,89} (1,6){71,74,78,81,84,88} (1,6){43,46,47,53,56,57} (1,6){42,45,49,52,55,59} (1,6){41,44,48,51,54,58} (1,6){13,16,17,23,26,27} (1,6){12,15,19,22,25,29} (1,6){11,14,18,21,24,28} (1,6){37,38,39,67,68,69,97,98,99} (2,9){34,35,36,64,65,66,94,95,96} (2,9){31,32,33,61,62,63,91,92,93} (2,9){16,17,23,27,33,36} (6){13,17,23,26,36,37} (6){13,16,26,27,33,37} (6){13,16,17,23,26,27,33,36,37} (size=9) <- 18 of them{74,75,78,79,84,85,88,89,94,95,98,99} (12){71,72,78,79,81,82,88,89,91,92,98,99} (12){71,72,74,75,81,82,84,85,91,92,94,95} (12){71,72,74,75,78,79,81,82,84,85,88,89,91,92,94,95,98,99} (size=18) <- 36 of them{46,48,49,54,55,57,74,75,77,86,88,89} (12){45,49,55,59,64,68,74,78,84,88,95,99} (12){45,46,48,54,57,59,64,68,75,77,78,84,86,89,95,99} (16){45,46,48,49,54,55,57,59,64,68,74,75,77,78,84,86,88,89,95,99} (size=20) <- 9 of them`

5. Checker actually doesn't benefit from MCN. Its major optimization is that clues are always placed to hit at least one additional UA, which is not the case in my algorithm. I hope the remapping solves partially the MCN vs Time linearity.
I am interested in statistics for Checker. Just 2 columns - grid and time spent.
Others thoughts? Yes, that is what I am asking for.
6. Assuming you meant 8-th clue is always redundand - yes, but like [3] narrowing looks unsignificant. Actually didn't check it.

Thank you!

MD
dobrichev
2016 Supporter

Posts: 1777
Joined: 24 May 2010

### Re: Need help in some probability calculations

indeed you seem to have incorporated many of those points.
thinking about it some more - if you have 3 disjoint u4s in a band - they are going to be already accounted for -
i cant seem to make a band with more than 3 disjoint u4s
Code: Select all
`+---+---+---+|157|243|896||238|196|457||946|578|312|+---+---+---+ this band [139/416] needs 5 clues+---+---+---+|1..|2..|...||238|1.6|45.| |.46|5.8|31.|+---+---+---+  and the 4 u4s are not disjoint`

Code: Select all
`+---+---+---+|237|158|946||156|439|827||489|276|513|+---+---+---+ this band  136/416 needs 6 clues +---+---+---+|2..|1..|.4.||1..|4..|.2.||4..|2..|.1.|+---+---+---+ one of the u9s that needs 2 clues - as you mentioned`

the bands are classified with index416.exe
There might even be a band that needs 7 clues dukuso must have done that some where !
From 2005, not quite what i am asking but he mentions the [9,2]
dukuso wrote:here are the bands with more than 4 4-unavoidable-sets
read vertically by minicolumns, so the first row is always 123456789

5,146258379413582697725831964
5,146258379413587692725831964
5,148256379412583697724831965
5,148256379413582697725831964
5,149256378412587693721834965
5,149258376412587693725834961
5,149256378417583692724865931
5,146258379412583697731825964
5,148259376412583697734821965
5,146258379412583697735864921
5,149256378413587692735861924
6,148256379413587692724831965
6,149256378413587692725831964
6,149256378413587692724865931
6,146258379412587693735821964
6,146258379413582697731825964
6,146259378413587692731825964
6,148256379412583697735821964
6,148256379412597683734865921
7,148256379417583692734821965

interesting the other (9,2) unavoidable set .......

The band with 7 u4s - it does not have 4 disjoint u4s.......
Code: Select all
`+---+---+---+|123|456|789||457|189|326||869|732|415|+---+---+---+  which needs at least 6 clues.......and it is band 145/416`

I suppose any grid which has 3 horizontal or vertical bands which all require 6 clues - doesn't have a 17 !!!

C
coloin

Posts: 1864
Joined: 05 May 2005

### Re: Need help in some probability calculations

coloin wrote:indeed you seem to have incorporated many of those points.

Nope! This was a research which prompted this is not the right way.

coloin wrote:I suppose any grid which has 3 horizontal or vertical bands which all require 6 clues - doesn't have a 17 !!!

Although this is off-topic, could we count the grids consisting of bands requiring (7 + 6 + 5) or (6 + 6 + 6) clues?
Are the listed bands S-classes (from the gang of 416)?
The Minlex catalogue is ordered by minlex isomorph of the S-class of the top band. Is it sufficient to walk trough the topmost bands-of-interest, then traverse the tree normalizing 2-nd and 3-rd band? How fast such normalization (hope incorporated in index416.exe) is?
dobrichev
2016 Supporter

Posts: 1777
Joined: 24 May 2010

### Re: Need help in some probability calculations

I have done the min clues needed for each of the 416 bands [from memory there is no 7 but there is one 2]
There looks to be around 10 bands which always need 6 clues, and numerous 5s - will append when i have results to hand.

The structures thread might be the best place.

The {6+6+6} grids are likely to be easily searchable anyway in terms of computation time !!!!!!

Index416 is lightning quick. it would be easy to do a search with a B123 and B147 cross. But getting grids with lower starting bands than the one in question might require a degree of reprogramming.
coloin

Posts: 1864
Joined: 05 May 2005

### Re: Need help in some probability calculations

dobrichev wrote:Although this is off-topic, could we count the grids consisting of bands requiring (7 + 6 + 5) or (6 + 6 + 6) clues?
Are the listed bands S-classes (from the gang of 416)?
The Minlex catalogue is ordered by minlex isomorph of the S-class of the top band. Is it sufficient to walk trough the topmost bands-of-interest, then traverse the tree normalizing 2-nd and 3-rd band? How fast such normalization (hope incorporated in index416.exe) is?

these options to my solver list the sorted band signature for each input grid
Code: Select all
`-q- -f%#bc`

~30,000 puz/sec/Ghz
the compressed catalog of all S-grids in minlex order, on a 8Gib microSD, is in the mail, delivery expected sometime next week
if you haven't generated it yet ask dukuso to send it to you next
gsf
2014 Supporter

Posts: 7306
Joined: 21 September 2005
Location: NJ USA

### Re: Need help in some probability calculations

coloin wrote:would be easy to do a search with a B123 and B147 cross.

I wrote something to do that (more or less) for the 2x4 min clues problem a few weeks back. Looked like with some code optimisation it could have proved/disproved that there's no 13-clue puzzle; but I didn't want to run it for the necessary couple of months only to have a single-line output "sorry, guv" as I'd never have been convinced that it had really worked.
Red Ed

Posts: 633
Joined: 06 June 2005

### Re: Need help in some probability calculations

gsf wrote:
dobrichev wrote:Although this is off-topic, could we count the grids consisting of bands requiring (7 + 6 + 5) or (6 + 6 + 6) clues?
Are the listed bands S-classes (from the gang of 416)?
The Minlex catalogue is ordered by minlex isomorph of the S-class of the top band. Is it sufficient to walk trough the topmost bands-of-interest, then traverse the tree normalizing 2-nd and 3-rd band? How fast such normalization (hope incorporated in index416.exe) is?

these options to my solver list the sorted band signature for each input grid
Code: Select all
`-q- -f%#bc`

~30,000 puz/sec/Ghz
the compressed catalog of all S-grids in minlex order, on a 8Gib microSD, is in the mail, delivery expected sometime next week
if you haven't generated it yet ask dukuso to send it to you next

Thank you!
I have generated the catalog.
I simply forgot the vertical bands ... too much calculations.
dobrichev
2016 Supporter

Posts: 1777
Joined: 24 May 2010

### Re: Need help in some probability calculations

Here is the breakdown of the 416 bands and the minimum clues for each band.
Code: Select all
`                                                                        2-   1                                                                        3-  99                                                                        4- 258                                                                        5-  52                                                                        6-   6     `

band 29 can be completed in 2 clues

The bands which require minimum of 6 clues, and the number of ways to complete it
Code: Select all
`                                                                        band    2   3   4   5      6                                                  132     0   0   0   0    162                                                  145     0   0   0   0    606                                                  1       0   0   0   0    729                                                  413     0   0   0   0    729                                                  168     0   0   0   0   2133                                                  202     0   0   0   0   2133 `

The bands which require minimum of 5 clues, and the number of ways to complete it
Code: Select all
`                band    2   3   4   5   35      0   0   0   63  127     0   0   0   63  53      0   0   0   78  148     0   0   0   90  34      0   0   0   90  150     0   0   0   90  164     0   0   0   108 188     0   0   0   156 54      0   0   0   168 138     0   0   0   177 271     0   0   0   213 119     0   0   0   244 4       0   0   0   252 415     0   0   0   252 116     0   0   0   258 139     0   0   0   300 131     0   0   0   301 70      0   0   0   309 74      0   0   0   312 2       0   0   0   324 407     0   0   0   324 414     0   0   0   324 81      0   0   0   326 146     0   0   0   331 160     0   0   0   374 38      0   0   0   378 149     0   0   0   410 67      0   0   0   421 193     0   0   0   485 52      0   0   0   490 55      0   0   0   499 51      0   0   0   527 380     0   0   0   529 135     0   0   0   539 147     0   0   0   540 173     0   0   0   543 63      0   0   0   546 297     0   0   0   594 170     0   0   0   692 292     0   0   0   777 167     0   0   0   852 232     0   0   0   852 166     0   0   0   864 400     0   0   0   864 16      0   0   0   972 8       0   0   0   1107190     0   0   0   1131240     0   0   0   1131212     0   0   0   1275224     0   0   0   1275201     0   0   0   1626378     0   0   0   1626`

It remains to be seen if this can be usefully employed
coloin

Posts: 1864
Joined: 05 May 2005

### Re: Need help in some probability calculations

same calculation for
2 bands = 6*9 sudokus (~150M estimated classes)
6*6 sudokus (B5689)
B12347 sudokus
3-rookeries
4-rookeries
...

can it be done ? had it been done ?
dukuso

Posts: 479
Joined: 25 June 2005

### Re: Need help in some probability calculations

Done a few weeks back for 2Ã—4 sudoku bands and stacks as part of the program that disproves(? - if I left it running for a couple of months) the existence of a 13-clue 2Ã—4 sudoku puzzle. But I've not done similar for any 3Ã—3 problems.
Red Ed

Posts: 633
Joined: 06 June 2005

### Re: Need help in some probability calculations

dukuso wrote:same calculation for
2 bands = 6*9 sudokus (~150M estimated classes)
? had it been done ?

Well - i cant prove it however but there is to my knowledge no 17-puzzle with 10 clues in band one, 4 clues in band 2 and 3 clues in band 3. So the min is [3,5,9 or 4,4,9] goes some way to confirm that minimum number of clues for a 3*3 is indeed greater than 12 !!! [although we "sort of" know its 17].

Plus 10 clues might be a reasonable minimum [quess]to cover the internal unavoidables in the 54 clues in many of the 150M estim. 2 bands.[The vast majority of grids do have a 19 - so the vast majority must have 12 clues or less somewhere] A plus 10 over 54 clues does take a long time - but not if dobrichev could implement it. Not sure if knowing the figure for double bands helps chose which unavoidables to employ........

There is also a B12347 which solves in 2 clues [ from a long time ago]. plus 4 clues might be the normal minimum. This is even less helpful.

C
Last edited by coloin on Tue Oct 26, 2010 6:42 pm, edited 1 time in total.
coloin

Posts: 1864
Joined: 05 May 2005

Next