The New Sudoku Players' Forum

by **dobrichev** » Thu May 09, 2024 11:13 pm

For Sudoku solution grid, consider canonical form following these rules

Choose an ordering rule of all 46656 1-digit templates. My choise is descending lexicographical order when template is represented by 1 and . on 81-char line.
First row has values 1 - 9, in this order.
Value 1 always fits the predefined "minimal" template.
Values 2 - 8 fit respective templates, so that the ordered list of indexes of templates is minimal.

This is similar to Anchor-5 canonicalization proposed by David P Bird which uses other anchored template, fixed box 5 instead of row 1, and finally row-min-lex.

This is the common part

Code: Select all: 123456789 ...1..... ......1.. .1....... ....1.... .......1. ..1...... .....1... ........1

In this form the grid can be represented as ordered sequence of 7 template indexes (for value 1 template is fixed, and value 9 takes the cells left unoccupied by other templates).

Canonicalization

Chosing the first template fixes rows-in-a-band permutations as well as columns-in-a-stack ones.
Templates for (9) values are initial candidates for the fixed first template, and they must be checked in their (6) band permutations, (6) stack permutations, and (2) transpose option. 9 * 6 * 6 * 2 = 648, the well known number.

Lookup tables

46656 ordered teplates for binary search. Knowing positions of particular value to find the template index.
46656 x 72 transformations - all 72 transformations placing the respective template to the minimal position for value 1.
Optional 7 tables mapping for column 2..8 raw template index to reduced one.
Optional 7 tables with template cell positions for reduced indexes.

Steps are

Code: Select all: for each source_value 1..9 { transform grid in all 72 ways so that this source_value fits the minimal template for target_value 1 } for each 648 transformed grids { for each r1_cell in 2..8 { find template index for this r1_cell discard transformed grid except the index is minimal } } all survived grids are equal and you can pick first one. If you need indexes, remap the 7 template indexes to reduced ones. If you need grid, either rebuld it from indexes or remap values so that firs row has 1..9.

Canonicalization troughput is in order of ~~20 000 - 50 000 grids/sec~~. Edit: Actually 2 000 to 5 000 grids/sec which is 10 times slower than ordinary minlexing and much more than band cycles minlexing.

For building grid from indexes just start from grid with predefined values set and all rest populated with 9, then loop over the templlates seting values for rows 2..9 from lookup table.
Troughput is several minutes for the whole catalog.

Catalog

As said above, any grid can be represented by 7 template indexes.

The values of indexes can be reduced. All templates starting from r1cX form separate range of 46656 / 9 = 5184 templates.
Further, all templates for c2..c8 must be disjoint to the fixed template for c1. This reduces the options respectively to 2097 2097 2097 2396 2396 2097 2396.
Further, not all templates are in use for a real canonicalized grid. Similarly to the fact that not all bands can be start band in the row-min-lex canonical form. Used templates for c2..c8 then become 59 1664 1951 2284 2336 1643 1899.
Further reductions depend on already fixed templates. For example, taking into account that the template for any new column must be disjoint to previously fixed templates (which can be computed on the fly), the maximal values in indexes is reduced to 59 854 416 156 60 72 8.

For catalog of indexes in range 59 1664 1951 2284 2336 1643 1899 the record can be 13 bytes - one byte for c2 and 2 bytes for c3..8. This gives 66 GB file. Compression with gzip reduces it to 18 GB, and zpaq reduces it to 5.9 GB (9.2 bits/grid). Zpaq is very slow, even in decompression.
The same indexes can be represented in 72-bit record (9 bytes instead of 13) by combining 1664 1951 2284 1643 into a 44-bit field and 59 2336 1899 into a 28-bit field. This gives 46 GB file which is poorly compressed further.

Catalog of indexes in range 59 854 416 156 60 72 8 initially stored using 13 bytes per record is compressed by zpaq to 2.8 GB, more precisely to 4.32381941 bits/grid. This is a new record, better than gsf's catalog reportedly compressed to 8.38 bits/grid. As in the gsf's case, the decompression is slow.
Converting indexes to grid from raw file isn't tested yet, but is expected to be slow and useful either for entire catalog extraction (expectations are about 8 hours) or for very small amount of single grids where the overhead related to dynamic reindexing will be large.

If catalog is split to windows, additional hints for non-used templates could be included.
For c2 all 59 windows can be hardcoded in the application.
For c2 + c3 there are 24353 windows.

Similarly to the row-min-lex catalog, some c2 templates closer to the beginning have huge number grids. At the end few grids start by the same c2 tempate.

The indexes in 59 1664 1951 2284 2336 1643 1899 for the top and bottom of the catalog are

Hidden Text: Show

The same indexes in 59 854 416 156 60 72 8 notation (dynamic reduction) are

Hidden Text: Show

Later I will provide some examples with grids in this canonical form.

by **champagne** » Fri May 10, 2024 7:31 am

Hi mladen,

dobrichev wrote:
Canonicalization troughput is in order of 20 000 - 50 000 grids/sec.

For building grid from indexes just start from grid with predefined values set and all rest populated with 9, then loop over the templlates seting values for rows 2..9 from lookup table.
Troughput is several minutes for the whole catalog.

If I read it in the proper way, this means that the whole catalog could be "built" in less than one hour. I expect around one core day on my old i7-4770
This could be an alternative to what I did, but the problem remains the agreement of the community on this canonical form.

One remark:
We have anyway 5472730538 ED solution grids. The simplest short for a solution grid remains it's rank 0- 5472730537 or 1-5472730538
and then, you don't need the file itself, but as I did, a simple way to go from the rank to the canonical solution grid.

And I have to find time to study your index structure :roll:

by **dobrichev** » Fri May 10, 2024 1:39 pm

Hi Gerard,

I am not considering this canonical form as popular replacement of the row-min-lex form. It is just an alternative which could be in use if / when it performs some tasks easier or significantly faster.

The whole catalog can't be built from scratch in less than a hour.
Using 13 bytes/grid file (66 GB) or 9 bytes/grid file (46 GB), all grids can be converted to 81-byte textual form within 10-20 minutes. Similarly the 10 bytes/grid minlex catalog (51 GB) can output all grids in textual form for 20-25 minutes.

Transforming minlex grids to min-templates took me about 2 CPU weeks. Meanwhile I did some improvements and now the timings are uncertain but expected to be between 20 and 50 K grids/second.

Building a catalog from scratch so far is out of the scope. I'm not competing with you in building a catalog from scratch

Template Indexes

Each grid can be represented as ordered sequence of 9 template indexes where first index corresponds to values for 1, second for 2, and so on. A template is one of 46656 valid single-digit placements.
46656^9 gives a upper limit for all possible grids, but we know most of them are invalid and therefore the list of real grids has huge gaps in this representation.

We want to reduce the indexes.

We can order templates so that those 46656/9 = 5184 having value in r1c1 are at the top of the list, followed by next 5184 that occupy r1c2, and so on.

Assuming grid representation re-labeled so that first row is 123456789, reducing the indexes for each position to 5184.

Next, obviously, the last template occupies the remaining cells after placing the first 8 templates, so we don't need it's index. At this point we need 8 indexes, each in range 0-5183.

Next, we can fix first template to a predefined one by permuting rows/cols/bands/stacks and transpose. This fixes template for r1c1 to constant and we don't need to store it. At this point we need 7 indexes, each in range 0-5183.

Since there is more than one possibility for transforming grid so that particular template becomes the predefined one for r1c1, we can add constaint that within all possible transformations, the one that results in minimal sequence of indexes is the chosen grid representation. Minimal means the first index of best representation <= first index of other representation, then same for second, third ... indexes.

Next we can reduce the indexes taking into account that any template must be disjoint to the fixed template for r1c1. At this point we need 7 indexes with range 2097 2097 2097 2396 2396 2097 2396 respectively. There are 2097 templates for r1c2 disjoint to predefined template for rc1, etc.

Next, after composing the catalog of all essentially different grids, we observe that not all templates are used for real canonical grid. It happens that for the chosen template ordering the possibilities for r1c2, r1c3, etc. are reduced respectively to 59 1664 1951 2284 2336 1643 1899.

Up to this point the index reduction is easy for hardcoding.
Record for each grid can be stored as one byte for first index plus 6x2 bytes for next 6 indexes = 13 bytes/record = 66 GB file.
Alternatively by encoding indexes into 2 integers and merging them we can store the same information in a 9 bytes record = 46 GB file. Since the bits within the record are rearranged the neighbour records aren't similar and the file can't be compressed further.
Both representations are good for practical purposes. Each record is self-consistent (catalog is not windowed, we can compose a catalog of any subset of the grids), binary search is possible (record comparison is fast and record size is fixed).

On fixed template for c2 we can further reduce the index range for c3, c4 and so on, taking into account that each next template has to be disjoint from the previous ones.
This increases the processing time both for encoding and decoding. Transforming from 13 bytes/record took about 8 CPU hours.
The global ranges after such reduction are 59 854 416 156 60 72 8.
59 for the c2 is untouched af course.
854 for the c3 is the maximal index within the whole catalog. For at least one of the 59 c2 templates, 854 valid options for c3 exist.

My understanding is that any further index reduction should discard, in particular context, the impossible templates, where impossibility is due to

resulting invalid grid due to non-empy intersection of current template with already fixed templates. This can be calculated on the fly.
invalid grid due to impossibility to place futher templates. We know such exotic cases exist after fixing 4 templates.
valid grid which has duplicate with lesser indexes,

Such reduction can be represented as bitmap for the particular index marking valid items in the parent context.
c2 consists of 6 bitmaps reducing indexes for c3...c8.
c3 within the context of each c2 consists of 5 shorter bitmaps reducing indexes for c4..c8.

Ideally, after such reduction, the record itsefl loses sence because each remaining valid index combination represents a valid grid. (Hmmm, isn't this building a catalog from scratch?)

Whether the bitmaps should account for disjoints (more space but possibly faster) is open issue.
Whether these bitmaps should be compressed, or at least reduced to the latest set bit, is unknown.
How to serialize such structure in a file isn't clear too.

I doubt this will reach any final even of academic interest. So far I am satisfied by providing a catalog requiring 4.32381941 bits/grid having capacity to be uncompressed to performant catalog within, say, 1 day.

by **champagne** » Sun May 12, 2024 7:24 am

Hi mladen,
I did not by far cover all your logic, but I think that this could be an alternative way in my process "after the row 4". Building on the fly still valid templates for the last 5 rows seems a promising idea against the classical cell by cell expansion.
Cheers
Gérard

by **coloin** » Sun May 12, 2024 11:29 am

Before you embark on big stuff... a thought
Consider that there are only 5 ED ways to have clues in this pattern

Code: Select all: +---+---+---+ |123|456|789| |4..|...|...| |5..|...|...| +---+---+---+ |6..|...|...| |7..|...|...| |8..|...|...| +---+---+---+ |9..|...|...| |2..|...|...| |3..|...|...| +---+---+---+

There are 81 instances of this pattern in every grid
but only 5 ED !! -here are their incidence in 100000 grids .. one per grid

Code: Select all: ........1........2........3........4........5........6........7........8123478569 [#1] 4363 ........1........2........3........4........5........6........7........8124378569 [#2] 13315 ........1........2........3........4........5........6........7........8124578369 [#3] 48617 ........1........2........3........4........5........6........7........8127458369 [#4] 11140 ........1........2........3........4........5........6........7........8147258369 [#5] 22565 100000 ........1........2........3........4........5........6........7........81....8.69 constant clues

Taking the most common of these there will be only a few grids which dont have this in one of their 81 :idea:

The number of ways to add the first template [r1c1 clue] is 5184 ways but reduced by equivalencies
Successive templates will be much less too

here are some and their solution counts, if the clue at r1c1 is the template clue the count is x10 less !

Code: Select all: 1234567894..1.....5.....1..7....1...81.......9.......12........3...1....6......1. 24 154412155 1234567894..1.....5.....1..7....1...81.......9.......12...1....3......1.6........ 24 157946797 1234567894..2.....5.....2..7....2...8.2......9......2.2........3.......26...2.... 24 1304115226

im not sure how you would process the "valid grid which has duplicate with lesser indexes" pruning though.

Edit - constant clues added
Edit - 1000/1000 random grids had the #3 occurance of clues, will try with 10000 !

by **champagne** » Sun May 12, 2024 2:36 pm

Hi coloin,

The idea is not to change the key option, a process building the catalog in the min lexical order, but to see if the template view after implementation of rows 1-4 can speed up the end of the process. As noticed mladen, applying all constraints, and mainly the necessity to have disjoints template downstream reduces drastically the number of valid patterns. It seems to me that this is similar to the concept of remaining disjoints UAs recently applied with success in the 17 clues scan search of valid bands 1+2.

But from the idea to something feasible, there is a lot of work :roll:

by **coloin** » Sun May 12, 2024 3:15 pm

champagne wrote: but to see if the template view after implementation of rows 1-4 can speed up the end of the process..

no ...as I understand it Mladen is just applying successive templates and not in row minlex order... more like template minlex order .. if that is a thing.

by **champagne** » Sun May 12, 2024 9:56 pm

coloin wrote:
champagne wrote: but to see if the template view after implementation of rows 1-4 can speed up the end of the process..

no ...as I understand it Mladen is just applying successive templates and not in row minlex order... more like template minlex order .. if that is a thing.

this is what mladen did, some hours later, I am more thinking to see if the template view can help for the last band where the first column is locked.
This gives at most 2x3 possible disjoint 3templates locked for other digits. More may be to-morrow

by **coloin** » Mon May 13, 2024 5:10 pm

Good if you can make it work for you , if it speeds up things even more.
I think Mladen's process is good, he has harnessed the reductions with a fixed 1 template, and there are only 7 [ as opposed to 8 ] ways for the remaining templates.
This 1 template clues are not obligate clues in minlex solution grids.... and adding more [constant] digits negates the advantage of having ony one option for template 1
But "taking into account that each next template has to be disjoint from the previous ones." representing dynamic reduction is excellent. [59 854 416 156 60 72 8]

by **dobrichev** » Tue May 14, 2024 2:32 pm

Bad news is that w/o additional large lookups the canonicalization is 10 times slower than my first measuring. I edited initial post, changing the value to 2-5 K grids/sec.
Using Gerard's approach for catalog generation / extraction / indexing would take forever in its brute force part.

Now I am estimating the data volume required for providing additional hints.

by **champagne** » Thu May 16, 2024 6:34 am

As I wrote in another thread, I am interested to know if this has been applied to each gangster to know what ED solution bands are valid for each of them.
If not, I intend to open a separate thread to study this

The New Sudoku Players' Forum

Min-templates Canonical Form and Catalog

Min-templates Canonical Form and Catalog

Re: Min-templates Canonical Form and Catalog

Re: Min-templates Canonical Form and Catalog

Re: Min-templates Canonical Form and Catalog

Re: Min-templates Canonical Form and Catalog

Re: Min-templates Canonical Form and Catalog

Re: Min-templates Canonical Form and Catalog

Re: Min-templates Canonical Form and Catalog

Re: Min-templates Canonical Form and Catalog

Re: Min-templates Canonical Form and Catalog

Re: Min-templates Canonical Form and Catalog