- Choose an ordering rule of all 46656 1-digit templates. My choise is descending lexicographical order when template is represented by 1 and . on 81-char line.
- First row has values 1 - 9, in this order.
- Value 1 always fits the predefined "minimal" template.
- Values 2 - 8 fit respective templates, so that the ordered list of indexes of templates is minimal.
This is similar to Anchor-5 canonicalization proposed by David P Bird which uses other anchored template, fixed box 5 instead of row 1, and finally row-min-lex.
This is the common part
- Code: Select all
123456789
...1.....
......1..
.1.......
....1....
.......1.
..1......
.....1...
........1
In this form the grid can be represented as ordered sequence of 7 template indexes (for value 1 template is fixed, and value 9 takes the cells left unoccupied by other templates).
Canonicalization
Chosing the first template fixes rows-in-a-band permutations as well as columns-in-a-stack ones.
Templates for (9) values are initial candidates for the fixed first template, and they must be checked in their (6) band permutations, (6) stack permutations, and (2) transpose option. 9 * 6 * 6 * 2 = 648, the well known number.
Lookup tables
- 46656 ordered teplates for binary search. Knowing positions of particular value to find the template index.
- 46656 x 72 transformations - all 72 transformations placing the respective template to the minimal position for value 1.
- Optional 7 tables mapping for column 2..8 raw template index to reduced one.
- Optional 7 tables with template cell positions for reduced indexes.
Steps are
- Code: Select all
for each source_value 1..9 {
transform grid in all 72 ways so that this source_value fits the minimal template for target_value 1
}
for each 648 transformed grids {
for each r1_cell in 2..8 {
find template index for this r1_cell
discard transformed grid except the index is minimal
}
}
all survived grids are equal and you can pick first one.
If you need indexes, remap the 7 template indexes to reduced ones.
If you need grid, either rebuld it from indexes or remap values so that firs row has 1..9.
Canonicalization troughput is in order of
For building grid from indexes just start from grid with predefined values set and all rest populated with 9, then loop over the templlates seting values for rows 2..9 from lookup table.
Troughput is several minutes for the whole catalog.
Catalog
As said above, any grid can be represented by 7 template indexes.
The values of indexes can be reduced. All templates starting from r1cX form separate range of 46656 / 9 = 5184 templates.
Further, all templates for c2..c8 must be disjoint to the fixed template for c1. This reduces the options respectively to 2097 2097 2097 2396 2396 2097 2396.
Further, not all templates are in use for a real canonicalized grid. Similarly to the fact that not all bands can be start band in the row-min-lex canonical form. Used templates for c2..c8 then become 59 1664 1951 2284 2336 1643 1899.
Further reductions depend on already fixed templates. For example, taking into account that the template for any new column must be disjoint to previously fixed templates (which can be computed on the fly), the maximal values in indexes is reduced to 59 854 416 156 60 72 8.
For catalog of indexes in range 59 1664 1951 2284 2336 1643 1899 the record can be 13 bytes - one byte for c2 and 2 bytes for c3..8. This gives 66 GB file. Compression with gzip reduces it to 18 GB, and zpaq reduces it to 5.9 GB (9.2 bits/grid). Zpaq is very slow, even in decompression.
The same indexes can be represented in 72-bit record (9 bytes instead of 13) by combining 1664 1951 2284 1643 into a 44-bit field and 59 2336 1899 into a 28-bit field. This gives 46 GB file which is poorly compressed further.
Catalog of indexes in range 59 854 416 156 60 72 8 initially stored using 13 bytes per record is compressed by zpaq to 2.8 GB, more precisely to 4.32381941 bits/grid. This is a new record, better than gsf's catalog reportedly compressed to 8.38 bits/grid. As in the gsf's case, the decompression is slow.
Converting indexes to grid from raw file isn't tested yet, but is expected to be slow and useful either for entire catalog extraction (expectations are about 8 hours) or for very small amount of single grids where the overhead related to dynamic reindexing will be large.
If catalog is split to windows, additional hints for non-used templates could be included.
For c2 all 59 windows can be hardcoded in the application.
For c2 + c3 there are 24353 windows.
Similarly to the row-min-lex catalog, some c2 templates closer to the beginning have huge number grids. At the end few grids start by the same c2 tempate.
The indexes in 59 1664 1951 2284 2336 1643 1899 for the top and bottom of the catalog are
Hidden Text: Show
The same indexes in 59 854 416 156 60 72 8 notation (dynamic reduction) are
Hidden Text: Show
Later I will provide some examples with grids in this canonical form.