Min-templates Canonical Form and Catalog

Everything about Sudoku that doesn't fit in one of the other sections

Min-templates Canonical Form and Catalog

Postby dobrichev » Thu May 09, 2024 11:13 pm

For Sudoku solution grid, consider canonical form following these rules
  • Choose an ordering rule of all 46656 1-digit templates. My choise is descending lexicographical order when template is represented by 1 and . on 81-char line.
  • First row has values 1 - 9, in this order.
  • Value 1 always fits the predefined "minimal" template.
  • Values 2 - 8 fit respective templates, so that the ordered list of indexes of templates is minimal.

This is similar to Anchor-5 canonicalization proposed by David P Bird which uses other anchored template, fixed box 5 instead of row 1, and finally row-min-lex.

This is the common part
Code: Select all
123456789
...1.....
......1..
.1.......
....1....
.......1.
..1......
.....1...
........1


In this form the grid can be represented as ordered sequence of 7 template indexes (for value 1 template is fixed, and value 9 takes the cells left unoccupied by other templates).

Canonicalization

Chosing the first template fixes rows-in-a-band permutations as well as columns-in-a-stack ones.
Templates for (9) values are initial candidates for the fixed first template, and they must be checked in their (6) band permutations, (6) stack permutations, and (2) transpose option. 9 * 6 * 6 * 2 = 648, the well known number.

Lookup tables
  • 46656 ordered teplates for binary search. Knowing positions of particular value to find the template index.
  • 46656 x 72 transformations - all 72 transformations placing the respective template to the minimal position for value 1.
  • Optional 7 tables mapping for column 2..8 raw template index to reduced one.
  • Optional 7 tables with template cell positions for reduced indexes.

Steps are
Code: Select all
for each source_value 1..9 {
   transform grid in all 72 ways so that this source_value fits the minimal template for target_value 1
}
for each 648 transformed grids {
   for each r1_cell in 2..8 {
      find template index for this r1_cell
      discard transformed grid except the index is minimal
   }
}
all survived grids are equal and you can pick first one.
If you need indexes, remap the 7 template indexes to reduced ones.
If you need grid, either rebuld it from indexes or remap values so that firs row has 1..9.


Canonicalization troughput is in order of 20 000 - 50 000 grids/sec. Edit: Actually 2 000 to 5 000 grids/sec which is 10 times slower than ordinary minlexing and much more than band cycles minlexing.

For building grid from indexes just start from grid with predefined values set and all rest populated with 9, then loop over the templlates seting values for rows 2..9 from lookup table.
Troughput is several minutes for the whole catalog.

Catalog

As said above, any grid can be represented by 7 template indexes.

The values of indexes can be reduced. All templates starting from r1cX form separate range of 46656 / 9 = 5184 templates.
Further, all templates for c2..c8 must be disjoint to the fixed template for c1. This reduces the options respectively to 2097 2097 2097 2396 2396 2097 2396.
Further, not all templates are in use for a real canonicalized grid. Similarly to the fact that not all bands can be start band in the row-min-lex canonical form. Used templates for c2..c8 then become 59 1664 1951 2284 2336 1643 1899.
Further reductions depend on already fixed templates. For example, taking into account that the template for any new column must be disjoint to previously fixed templates (which can be computed on the fly), the maximal values in indexes is reduced to 59 854 416 156 60 72 8.

For catalog of indexes in range 59 1664 1951 2284 2336 1643 1899 the record can be 13 bytes - one byte for c2 and 2 bytes for c3..8. This gives 66 GB file. Compression with gzip reduces it to 18 GB, and zpaq reduces it to 5.9 GB (9.2 bits/grid). Zpaq is very slow, even in decompression.
The same indexes can be represented in 72-bit record (9 bytes instead of 13) by combining 1664 1951 2284 1643 into a 44-bit field and 59 2336 1899 into a 28-bit field. This gives 46 GB file which is poorly compressed further.

Catalog of indexes in range 59 854 416 156 60 72 8 initially stored using 13 bytes per record is compressed by zpaq to 2.8 GB, more precisely to 4.32381941 bits/grid. This is a new record, better than gsf's catalog reportedly compressed to 8.38 bits/grid. As in the gsf's case, the decompression is slow.
Converting indexes to grid from raw file isn't tested yet, but is expected to be slow and useful either for entire catalog extraction (expectations are about 8 hours) or for very small amount of single grids where the overhead related to dynamic reindexing will be large.

If catalog is split to windows, additional hints for non-used templates could be included.
For c2 all 59 windows can be hardcoded in the application.
For c2 + c3 there are 24353 windows.

Similarly to the row-min-lex catalog, some c2 templates closer to the beginning have huge number grids. At the end few grids start by the same c2 tempate.

The indexes in 59 1664 1951 2284 2336 1643 1899 for the top and bottom of the catalog are
Hidden Text: Show
Code: Select all
0   157   736   1561   2261   69   734
0   157   736   1561   2261   69   905
0   157   736   1561   2261   71   729
0   157   736   1561   2261   71   911
0   157   736   1561   2261   71   1204
0   157   736   1561   2261   71   1385
0   157   736   1561   2261   78   899
...
54   322   477   1912   1594   1640   714
54   322   1931   1416   610   1486   845
54   339   1653   1568   601   1630   845
54   339   1931   1416   600   1468   845
54   404   1931   1416   726   1410   506
54   407   1931   1416   724   1408   506
54   420   1931   1416   749   1393   506
54   423   1931   1416   747   1391   506
54   433   1931   1416   610   1486   506
54   444   1931   1416   600   1468   506
54   616   1887   1385   749   1403   431
54   644   1379   1878   652   503   57
54   648   1753   1450   652   503   57
54   764   303   2272   1060   150   1501
55   431   1946   47   1451   428   1399
56   62    984   1436   2103   886   202
56   506   1409   79   2008   866   905
56   728   1041   28   1933   1026   1727
57   37    823   2107   1678   39   1355
57   91    823   2107   1706   39   1257
57   462   211   2060   1745   103   1303
57   1137   984   109   1875   1424   521
57   1293   290   1633   1154   1209   194
58   20    823   2105   1686   39   1374
58   78    823   2105   1729   39   1247

Note that c2=55 has a single grid.

The same indexes in 59 854 416 156 60 72 8 notation (dynamic reduction) are
Hidden Text: Show
Code: Select all
0   0   0   0   0   0   1
0   0   0   0   0   1   0
0   0   0   0   0   1   1
0   0   0   0   0   1   2
0   0   0   0   0   1   3
0   0   0   0   0   3   1
0   0   0   0   0   3   2
0   0   0   0   0   3   3
0   0   0   0   0   7   1
0   0   0   0   0   8   1
...
54   103   36   42   2   7   0
54   103   189   37   14   5   1
54   115   144   54   13   4   0
54   115   174   38   14   2   1
54   146   284   36   19   11   0
54   148   286   34   20   9   0
54   154   281   33   20   6   0
54   156   283   31   20   7   0
54   159   268   32   15   4   0
54   167   255   28   12   3   0
54   258   269   40   13   6   0
54   275   160   42   14   4   0
54   278   264   65   14   4   0
54   350   28   64   8   0   0
55   146   273   2   0   2   0
56   45   55   7   3   9   0
56   196   150   5   8   3   0
56   317   114   1   0   0   1
57   20   26   42   1   0   1
57   57   22   43   3   1   1
57   179   21   40   20   0   1
57   564   127   6   0   4   0
57   647   53   77   24   4   0
58   12   19   60   1   0   1
58   52   31   40   2   3   1

Later I will provide some examples with grids in this canonical form.
Last edited by dobrichev on Tue May 14, 2024 2:23 pm, edited 1 time in total.
dobrichev
2016 Supporter
 
Posts: 1859
Joined: 24 May 2010

Re: Min-templates Canonical Form and Catalog

Postby champagne » Fri May 10, 2024 7:31 am

Hi mladen,


dobrichev wrote:
Canonicalization troughput is in order of 20 000 - 50 000 grids/sec.

For building grid from indexes just start from grid with predefined values set and all rest populated with 9, then loop over the templlates seting values for rows 2..9 from lookup table.
Troughput is several minutes for the whole catalog.


If I read it in the proper way, this means that the whole catalog could be "built" in less than one hour. I expect around one core day on my old i7-4770
This could be an alternative to what I did, but the problem remains the agreement of the community on this canonical form.

One remark:
We have anyway 5472730538 ED solution grids. The simplest short for a solution grid remains it's rank 0- 5472730537 or 1-5472730538
and then, you don't need the file itself, but as I did, a simple way to go from the rank to the canonical solution grid. ;) ;)

And I have to find time to study your index structure :roll:
champagne
2017 Supporter
 
Posts: 7409
Joined: 02 August 2007
Location: France Brittany

Re: Min-templates Canonical Form and Catalog

Postby dobrichev » Fri May 10, 2024 1:39 pm

Hi Gerard,

I am not considering this canonical form as popular replacement of the row-min-lex form. It is just an alternative which could be in use if / when it performs some tasks easier or significantly faster.

The whole catalog can't be built from scratch in less than a hour.
Using 13 bytes/grid file (66 GB) or 9 bytes/grid file (46 GB), all grids can be converted to 81-byte textual form within 10-20 minutes. Similarly the 10 bytes/grid minlex catalog (51 GB) can output all grids in textual form for 20-25 minutes.

Transforming minlex grids to min-templates took me about 2 CPU weeks. Meanwhile I did some improvements and now the timings are uncertain but expected to be between 20 and 50 K grids/second.

Building a catalog from scratch so far is out of the scope. I'm not competing with you in building a catalog from scratch :)

Template Indexes

Each grid can be represented as ordered sequence of 9 template indexes where first index corresponds to values for 1, second for 2, and so on. A template is one of 46656 valid single-digit placements.
46656^9 gives a upper limit for all possible grids, but we know most of them are invalid and therefore the list of real grids has huge gaps in this representation.

We want to reduce the indexes.

We can order templates so that those 46656/9 = 5184 having value in r1c1 are at the top of the list, followed by next 5184 that occupy r1c2, and so on.

Assuming grid representation re-labeled so that first row is 123456789, reducing the indexes for each position to 5184.

Next, obviously, the last template occupies the remaining cells after placing the first 8 templates, so we don't need it's index. At this point we need 8 indexes, each in range 0-5183.

Next, we can fix first template to a predefined one by permuting rows/cols/bands/stacks and transpose. This fixes template for r1c1 to constant and we don't need to store it. At this point we need 7 indexes, each in range 0-5183.

Since there is more than one possibility for transforming grid so that particular template becomes the predefined one for r1c1, we can add constaint that within all possible transformations, the one that results in minimal sequence of indexes is the chosen grid representation. Minimal means the first index of best representation <= first index of other representation, then same for second, third ... indexes.

Next we can reduce the indexes taking into account that any template must be disjoint to the fixed template for r1c1. At this point we need 7 indexes with range 2097 2097 2097 2396 2396 2097 2396 respectively. There are 2097 templates for r1c2 disjoint to predefined template for rc1, etc.

Next, after composing the catalog of all essentially different grids, we observe that not all templates are used for real canonical grid. It happens that for the chosen template ordering the possibilities for r1c2, r1c3, etc. are reduced respectively to 59 1664 1951 2284 2336 1643 1899.

Up to this point the index reduction is easy for hardcoding.
Record for each grid can be stored as one byte for first index plus 6x2 bytes for next 6 indexes = 13 bytes/record = 66 GB file.
Alternatively by encoding indexes into 2 integers and merging them we can store the same information in a 9 bytes record = 46 GB file. Since the bits within the record are rearranged the neighbour records aren't similar and the file can't be compressed further.
Both representations are good for practical purposes. Each record is self-consistent (catalog is not windowed, we can compose a catalog of any subset of the grids), binary search is possible (record comparison is fast and record size is fixed).

On fixed template for c2 we can further reduce the index range for c3, c4 and so on, taking into account that each next template has to be disjoint from the previous ones.
This increases the processing time both for encoding and decoding. Transforming from 13 bytes/record took about 8 CPU hours.
The global ranges after such reduction are 59 854 416 156 60 72 8.
59 for the c2 is untouched af course.
854 for the c3 is the maximal index within the whole catalog. For at least one of the 59 c2 templates, 854 valid options for c3 exist.

My understanding is that any further index reduction should discard, in particular context, the impossible templates, where impossibility is due to
  • resulting invalid grid due to non-empy intersection of current template with already fixed templates. This can be calculated on the fly.
  • invalid grid due to impossibility to place futher templates. We know such exotic cases exist after fixing 4 templates.
  • valid grid which has duplicate with lesser indexes,
Such reduction can be represented as bitmap for the particular index marking valid items in the parent context.
c2 consists of 6 bitmaps reducing indexes for c3...c8.
c3 within the context of each c2 consists of 5 shorter bitmaps reducing indexes for c4..c8.

Ideally, after such reduction, the record itsefl loses sence because each remaining valid index combination represents a valid grid. (Hmmm, isn't this building a catalog from scratch?)

Whether the bitmaps should account for disjoints (more space but possibly faster) is open issue.
Whether these bitmaps should be compressed, or at least reduced to the latest set bit, is unknown.
How to serialize such structure in a file isn't clear too.

I doubt this will reach any final even of academic interest. So far I am satisfied by providing a catalog requiring 4.32381941 bits/grid having capacity to be uncompressed to performant catalog within, say, 1 day.
dobrichev
2016 Supporter
 
Posts: 1859
Joined: 24 May 2010

Re: Min-templates Canonical Form and Catalog

Postby champagne » Sun May 12, 2024 7:24 am

Hi mladen,
I did not by far cover all your logic, but I think that this could be an alternative way in my process "after the row 4". Building on the fly still valid templates for the last 5 rows seems a promising idea against the classical cell by cell expansion.
Cheers
Gérard
champagne
2017 Supporter
 
Posts: 7409
Joined: 02 August 2007
Location: France Brittany

Re: Min-templates Canonical Form and Catalog

Postby coloin » Sun May 12, 2024 11:29 am

Before you embark on big stuff... a thought
Consider that there are only 5 ED ways to have clues in this pattern
Code: Select all
+---+---+---+
|123|456|789|
|4..|...|...|
|5..|...|...|
+---+---+---+
|6..|...|...|
|7..|...|...|
|8..|...|...|
+---+---+---+
|9..|...|...|
|2..|...|...|
|3..|...|...|
+---+---+---+

There are 81 instances of this pattern in every grid
but only 5 ED !! -here are their incidence in 100000 grids .. one per grid
Code: Select all
........1........2........3........4........5........6........7........8123478569   [#1]    4363
........1........2........3........4........5........6........7........8124378569   [#2]   13315
........1........2........3........4........5........6........7........8124578369   [#3]   48617
........1........2........3........4........5........6........7........8127458369   [#4]   11140
........1........2........3........4........5........6........7........8147258369   [#5]   22565
                                                                                          100000
........1........2........3........4........5........6........7........81....8.69         constant clues

Taking the most common of these there will be only a few grids which dont have this in one of their 81 :idea:

The number of ways to add the first template [r1c1 clue] is 5184 ways but reduced by equivalencies
Successive templates will be much less too

here are some and their solution counts, if the clue at r1c1 is the template clue the count is x10 less !
Code: Select all
1234567894..1.....5.....1..7....1...81.......9.......12........3...1....6......1.   24   154412155
1234567894..1.....5.....1..7....1...81.......9.......12...1....3......1.6........   24   157946797
1234567894..2.....5.....2..7....2...8.2......9......2.2........3.......26...2....   24   1304115226


im not sure how you would process the "valid grid which has duplicate with lesser indexes" pruning though.

Edit - constant clues added
Edit - 1000/1000 random grids had the #3 occurance of clues, will try with 10000 !
Last edited by coloin on Tue May 14, 2024 8:27 pm, edited 2 times in total.
coloin
 
Posts: 2449
Joined: 05 May 2005
Location: Devon

Re: Min-templates Canonical Form and Catalog

Postby champagne » Sun May 12, 2024 2:36 pm

Hi coloin,

The idea is not to change the key option, a process building the catalog in the min lexical order, but to see if the template view after implementation of rows 1-4 can speed up the end of the process. As noticed mladen, applying all constraints, and mainly the necessity to have disjoints template downstream reduces drastically the number of valid patterns. It seems to me that this is similar to the concept of remaining disjoints UAs recently applied with success in the 17 clues scan search of valid bands 1+2.

But from the idea to something feasible, there is a lot of work :roll: :roll:
champagne
2017 Supporter
 
Posts: 7409
Joined: 02 August 2007
Location: France Brittany

Re: Min-templates Canonical Form and Catalog

Postby coloin » Sun May 12, 2024 3:15 pm

champagne wrote: but to see if the template view after implementation of rows 1-4 can speed up the end of the process..

no ...as I understand it Mladen is just applying successive templates and not in row minlex order... more like template minlex order .. if that is a thing.
coloin
 
Posts: 2449
Joined: 05 May 2005
Location: Devon

Re: Min-templates Canonical Form and Catalog

Postby champagne » Sun May 12, 2024 9:56 pm

coloin wrote:
champagne wrote: but to see if the template view after implementation of rows 1-4 can speed up the end of the process..

no ...as I understand it Mladen is just applying successive templates and not in row minlex order... more like template minlex order .. if that is a thing.

this is what mladen did, some hours later, I am more thinking to see if the template view can help for the last band where the first column is locked.
This gives at most 2x3 possible disjoint 3templates locked for other digits. More may be to-morrow
champagne
2017 Supporter
 
Posts: 7409
Joined: 02 August 2007
Location: France Brittany

Re: Min-templates Canonical Form and Catalog

Postby coloin » Mon May 13, 2024 5:10 pm

Good if you can make it work for you , if it speeds up things even more.
I think Mladen's process is good, he has harnessed the reductions with a fixed 1 template, and there are only 7 [ as opposed to 8 ] ways for the remaining templates.
This 1 template clues are not obligate clues in minlex solution grids.... and adding more [constant] digits negates the advantage of having ony one option for template 1
But "taking into account that each next template has to be disjoint from the previous ones." representing dynamic reduction is excellent. [59 854 416 156 60 72 8]
coloin
 
Posts: 2449
Joined: 05 May 2005
Location: Devon

Re: Min-templates Canonical Form and Catalog

Postby dobrichev » Tue May 14, 2024 2:32 pm

Bad news is that w/o additional large lookups the canonicalization is 10 times slower than my first measuring. I edited initial post, changing the value to 2-5 K grids/sec.
Using Gerard's approach for catalog generation / extraction / indexing would take forever in its brute force part.

Now I am estimating the data volume required for providing additional hints.
dobrichev
2016 Supporter
 
Posts: 1859
Joined: 24 May 2010

Re: Min-templates Canonical Form and Catalog

Postby champagne » Thu May 16, 2024 6:34 am

As I wrote in another thread, I am interested to know if this has been applied to each gangster to know what ED solution bands are valid for each of them.
If not, I intend to open a separate thread to study this
champagne
2017 Supporter
 
Posts: 7409
Joined: 02 August 2007
Location: France Brittany


Return to General