High density files for solution grids and 18 clues puzzles

Programs which generate, solve, and analyze Sudoku puzzles

High density files for solution grids and 18 clues puzzles

Postby champagne » Wed Mar 27, 2024 1:58 pm

We know that we have 5 472 730 538 ED solution grids and for the 18 clues ED puzzles, blue wrote long ago
... my estimate of 1,934,000,000 18C puzzles


Exhaustive files in text mode for all or part of these 2 sets, even using the high standard compression lead to huge files.
JPF recently proposed a compressed test file of 102 million puzzles of 1 GB. This is about 5% of the expected ED file for the 18c.

As far as I remember, several other ways to describe these huge sets have been discussed. Mathimagics surely had a specific view to flag ED solution grids on the minimum number of clues needed.

Is there somewhere a description of such views??
champagne
2017 Supporter
 
Posts: 7455
Joined: 02 August 2007
Location: France Brittany

Re: High density files for solution grids and 18 clues puzzl

Postby eleven » Wed Mar 27, 2024 11:26 pm

Hm, i only found, that Mathemagics had a look at gsf's compression (about one byte per grid), see e.g. here.
eleven
 
Posts: 3151
Joined: 10 February 2008

Re: High density files for solution grids and 18 clues puzzl

Postby champagne » Thu Mar 28, 2024 8:27 am

eleven wrote:Hm, i only found, that Mathemagics had a look at gsf's compression (about one byte per grid), see e.g. here.

Hi eleven,

Thank for this first link;
I read long ago something about this gsf''compression, but also that some of our friends have built a kind of "direct access" to each piece of the catalog.

Unhappily, the web pages of gsf are lost, but this is likely a compression mode optimized for the catalog.
Regarding the 18 clues puzzles, the file given by JPF has a high compression rate using the standard .rar compression. The .zip compression is already highly efficient (from 8.275 to 1.453). This is likely enough to share partial files.

On my side, I had a look to the stack 1 in a catalog shown in the classical minimal lexical mode.

The box 1 is one of
Code: Select all
123
45a
bcd 


where abcd can be one of these 10 values 6789 6798 6879 6897 7689 7698 7869 7968 7896 7986
The possible stacks are defined in "gangster mode" as soon as the pattern for the mini columns 1 and 2 in the box 2 are known.

This is 56 possibilities in the general case, but as we have r4c1=2, this is reduced here to 28.
And for each of the 28
. the column 1 has one solution
. each of the mini columns 2 and 3 in boxes 2 and 3 have 6 possibilities.

This gives, for each of the 10 valid patterns in box 1, 28 x 6**4 = 28 x 1296 =36288 possibilities easy to spot with a short piece of code.
And in total, this gives 416 x 36288 = 15 095 808 possible entries for a direct access organization.

For sure, not all these entries are valid. For example, the stack id must be >= first band ID, if not, we have a lower morph of the solution grid using stacks instead of bands.

This is likely not new, but a promising way o build a direct access.
champagne
2017 Supporter
 
Posts: 7455
Joined: 02 August 2007
Location: France Brittany

Re: High density files for solution grids and 18 clues puzzl

Postby JPF » Thu Mar 28, 2024 7:08 pm

FWIW, see gsf's post here.

I have no knowledge of file compression optimization techniques but one can improve the compression of my 102M 18C file by using the MinLex representation for each puzzle.
Here are the results in bytes:
Code: Select all
initial file            8.47 G
compressed rar file     1.16 G

MinLex File             8.47 G
compressed rar file     0.62 G

Note that my MinLex file was determined by the gsf/MD program and is therefore not completely "minlex" due to the well-known bug.

JPF
JPF
2017 Supporter
 
Posts: 6139
Joined: 06 December 2005
Location: Paris, France

Re: High density files for solution grids and 18 clues puzzl

Postby dobrichev » Thu Mar 28, 2024 7:27 pm

Hi Champagne,

A 10 bytes/grid implementation is here.
You need a 50GB lookup file if you want to extract a grid by its number. gridchecker laso has functionalities to get range of grids by index.

There is one free bit in the grid catalogue which can be used to store one of the 81-bit uncompressed bitmask of the givens. In total 20 bytes/puzzle, w/o 50GB catalogue, and each record is self-consistent.
Using grids catalogue you can store the grid index (17 bits) + somehow compressed givens bitmap.

On the other hand you can ignore the solution and represent the 18 givens in base-9 format (< 56 bits = 7 bytes assuming min-lex relabelling; < 10 bytes for the solution), again plus the pattern.

The pattern could be compressed by storing for each given the distance to previous one and somehow encoding into variable number of bits depending on statistics. If the distance can be encoded in less than 4.5 bits in average (81 bits uncompressed / 18 = 4.5) then pattern compression wins.
Assuming pattern min-lex, several positions at the beginning are guaranted empty and could be excluded from encoding - both uncompressed or compressed.

Note that applying such acrobatics would result in a file that isn't well compressable by standard utilities like ZIP.
dobrichev
2016 Supporter
 
Posts: 1863
Joined: 24 May 2010

Re: High density files for solution grids and 18 clues puzzl

Postby champagne » Fri Mar 29, 2024 7:47 am

JPF wrote:I have no knowledge of file compression optimization techniques but one can improve the compression of my 102M 18C file by using the MinLex representation for each puzzle.
Here are the results in bytes:
Code: Select all
initial file            8.47 G
compressed rar file     1.16 G

MinLex File             8.47 G
compressed rar file     0.62 G


JPF


Amazing compression ratio here, the standard tools would be very hard to beat.
champagne
2017 Supporter
 
Posts: 7455
Joined: 02 August 2007
Location: France Brittany

Re: High density files for solution grids and 18 clues puzzl

Postby champagne » Fri Mar 29, 2024 8:22 am

dobrichev wrote:Hi Champagne,

A 10 bytes/grid implementation is here.
You need a 50GB lookup file if you want to extract a grid by its number.
In total 20 bytes/puzzle, w/o 50GB catalogue, and each record is self-consistent.


Hi mladen,
Thanks for this example.

I made a quick variant of your work in line with the band1/stack1 start described above.

My best process in this case seems to be

See witch of the band1/stack1 it is (15 095 808 patterns)

Code: Select all
fill row 4  ( 6 remaining  cells)
fill column 4 ( 5 remaining  cells)
fill box 5  (4  remaining cells)
in any order for 3 remaining cells  fill rows 4,5,6 and columns 4,5,6
fill row 7 (3 remaining cells)
fill column 7 (2 remaining cells)
fill row 8 (2 remaining  cells)
last in row 9


this is if I am right 2 321 901 158 400 theoretical patterns

If we use 2 x 64 bits numbers to qualify the pattern in this way, we need 34+42 = 76 bits
I did not check, but your compression is likely around 76 bits if you replace small patterns by a bigger one.

If the catalog is made of 2 files, one index file plus one binary file for the rest, the bit size of the index can be skipped in the binary file.

One option could be to use an index made of the band 1 index plus the stack1 box2 gangster (416 * 28 = 11648 index items)
Then, the remaining patterns is one of 3 009 183 901 286 400, something needing 52 bits in binary mode. (one 64 bit number)

The files can be produced using a catalog builder, likely the best way to do it
champagne
2017 Supporter
 
Posts: 7455
Joined: 02 August 2007
Location: France Brittany

Re: High density files for solution grids and 18 clues puzzl

Postby m_b_metcalf » Sun Mar 31, 2024 9:20 am

champagne wrote:Unhappily, the web pages of gsf are lost,

champagne,
If you're referring to the long-defunct Programmers' Forum, a reminder that it was archived here. Unfortunately, it is non-searchable, so some patience is required to find a given thread, but gsf's posts should be there somewhere.

Regards,

Mike
User avatar
m_b_metcalf
2017 Supporter
 
Posts: 13624
Joined: 15 May 2006
Location: Berlin

Re: High density files for solution grids and 18 clues puzzl

Postby champagne » Sun Mar 31, 2024 2:06 pm

m_b_metcalf wrote:
champagne wrote:Unhappily, the web pages of gsf are lost,

champagne,
If you're referring to the long-defunct Programmers' Forum, a reminder that it was archived here. Unfortunately, it is non-searchable, so some patience is required to find a given thread, but gsf's posts should be there somewhere.

Regards,

Mike


Hi Mike,

In fact I tried, following Mathimagics' post to reach this web page.

http://gsf.cococlyde.org/download/sudoku
champagne
2017 Supporter
 
Posts: 7455
Joined: 02 August 2007
Location: France Brittany

Re: High density files for solution grids and 18 clues puzzl

Postby JPF » Sun Mar 31, 2024 4:53 pm

In addition to the link I provided previously, it may be interesting to read this comment provided in the user manual of the gsf program.
gsf wrote:COMPRESSED GRID FORMAT
The sudz format efficiently compresses catalogs of row order minlex
canonical solution grids. Grids are organized by the top band (top 3
rows). There are 416 essentially different minlex bands and 5472730538
essentially different grids. A byproduct of minlex ordering is that
earlier bands tend to account for more grids than later bands. For
example, band 001 contains 1007170 grids, band 006 (the largest)
contains 96229042 grids, and bands 395,397,398,400,402,403,404,406,
408,409,410,412,413,414,415 contain no grids.

The sudz format is a sequence of windows. Each window contains the number
of grids and initial band index. Each grid has the band index (if
different from the previous grid), the number of automorphisms (if > 1),
the number of cells that differ from the previous grid, and the list of
differing cell values encoded using a basic singles constraint solver.
The windows are compressed using the Burrows-Wheeler bzip library.

The entire catalog of 5472730538 essentially different grids, in minlex
order, compresses to 5.70GiB, an average 8.96 bits/grid. Uncompress rate
is ~100K grids/sec/Ghz, or ~5 hours minimum to stream through the entire
catalog on a 2.8Ghz processor.

Note that this only deals with compressing the grid catalog and does not directly concern a puzzle package.

JPF
JPF
2017 Supporter
 
Posts: 6139
Joined: 06 December 2005
Location: Paris, France

Re: High density files for solution grids and 18 clues puzzl

Postby rjamil » Sun Mar 31, 2024 6:41 pm

Hi champagne,

champagne wrote:In fact I tried, following Mathimagics' post to reach this web page.

http://gsf.cococlyde.org/download/sudoku

Download the gsf's software from here.

R. Jamil
rjamil
 
Posts: 774
Joined: 15 October 2014
Location: Karachi, Pakistan

Re: High density files for solution grids and 18 clues puzzl

Postby coloin » Mon Apr 01, 2024 4:20 pm

JPF wrote:Note that this only deals with compressing the grid catalog and does not directly concern a puzzle package.JPF


yes there are indeed two issues

minlexing the puzzles [ rapidly by software here]
This works on a clean text file, doesnt have the bug and is able to reorder on output.

Generating the grid cataloque is done admirably by gsf, but if we are only concerned with the whole unordered file we possibly could compress it furthur
one idea might be to use the fact that each min lex grid solution has clues in common
Code: Select all
123456789456.89123789123465215937648634218597897645231378564912562391874941872356    random minlex grid solution                     
12345678945................2.....................................................    common clues in minlex solution grid             
12345678945................2....7.4.....1......7...23......4.1....3..8....187.356    clues removed if possible moving along the string   
................................7.4.....1......7...23......4.1....3..8....187.356    16C 

so each grid can be reduced to around 16 clues.
and maybe more ordering reductions can be achieved without the need for a initial band tag.
coloin
 
Posts: 2494
Joined: 05 May 2005
Location: Devon

Re: High density files for solution grids and 18 clues puzzl

Postby champagne » Tue Apr 02, 2024 6:09 am

coloin wrote:
JPF wrote:Note that this only deals with compressing the grid catalog and does not directly concern a puzzle package.JPF


yes there are indeed two issues

minlexing the puzzles [ rapidly by software here]
This works on a clean text file, doesnt have the bug and is able to reorder on output.

Generating the grid cataloque is done admirably by gsf, but if we are only concerned with the whole unordered file we possibly could compress it furthur
one idea might be to use the fact that each min lex grid solution has clues in common
Code: Select all
123456789456.89123789123465215937648634218597897645231378564912562391874941872356    random minlex grid solution                     
12345678945................2.....................................................    common clues in minlex solution grid             
12345678945................2....7.4.....1......7...23......4.1....3..8....187.356    clues removed if possible moving along the string   
................................7.4.....1......7...23......4.1....3..8....187.356    16C 

so each grid can be reduced to around 16 clues.
and maybe more ordering reductions can be achieved without the need for a initial band tag.


Hi coloin,

7r2c4 is missing in your minlex grid

I did not catch how you build your 16 clues list, can you tell more
The band id 1-416 forces immediately 28 clues, and more likely for each of the 416 starts. Is it your view?

Some stats from existing version of the catalog (I have different ones in the 17 clues scan)

A catalog has around 600 million pairs for the 2 first bands.
This gives about 9 solution grids in average per pair bands 1+2.

But
most of the solution grids are attached to the first part of the 416
The number of bands 3 attached to a pair (band1,band2) goes done sharply when the index 1-416 grows.
The number of bands3 attached can exceed 256, so the mean can be somehow misleading.

Surely, for high 1-416 index, the number of common clues in the small number of solution grids attached can be hig.
Interesting point anyway to consider but i need first to have a catalog builder in the minlex form and giving strictly solution grids in the minlex order to see what this can give.

.
champagne
2017 Supporter
 
Posts: 7455
Joined: 02 August 2007
Location: France Brittany

Re: High density files for solution grids and 18 clues puzzl

Postby coloin » Wed Apr 03, 2024 5:55 pm

I think the discussion has gone along the lines of this one on page one and some of the comments very relevant.

Interesting i decided to see if gsf's solver could generate the minlex grids, and it can quite easily with the command
Code: Select all
 sudoku-64 -gb1  > band1.txt
though it seems
Code: Select all
sudoku-64 -gb0  > allbands.txt
would not be recommended as it proceeds to generate all !!

Getting the program to compress the files - as Mathimagics found out didnt work with the newer OS. Though it did work before .

I removed the clues by attempting to remove redundant clues until all clues left were necessary to define the solution.
I dont think you can utilize anymore than 28 clues with a band.
If you are relying on the constant 11 clues in band 1, some band 1s probably have U4s in r2and r3 which will mean a clue is needed in band 1 - and the string will be longer than the 54 digits in band 2 and 3..

Maybe a better way to store the grids in smaller files could be postulated, as the band files generated above are too big

JPF's pattern using a 3-box corner pattern B124 [ 56676 ED ways] look up file which would combine with another 3-box corner B689. Only 24 clues of the B689 need to be filed / compressed advantageously !
eg
Code: Select all
+---+---+---+
|123|456|...|
|456|789|...|
|789|123|...|
+---+---+---+
|214|...|897|
|365|...|214|
|897|...|365|
+---+---+---+
|...|642|978|
|...|978|531|
|...|531|642|
+---+---+---+

The ordering of the first corner may well be the same as the band ordering as per gsf
coloin
 
Posts: 2494
Joined: 05 May 2005
Location: Devon

Re: High density files for solution grids and 18 clues puzzl

Postby champagne » Thu Apr 04, 2024 7:32 am

Hi coloin,
coloin wrote:I think the discussion has gone along the lines of this one on page one and some of the comments very relevant.

Thanks for this old link. Here, the goal is clear and most of this discussion is out of the scope, but, as in mladen's compressor, similar ideas are applied.

Thinking more about the goal, I feel that the stack 1 pattern is not a good way. The "compression" should be strictly in the sorting sequence of the minimal lexical form to the solution grid.

And as you write, an exhaustive file uncompressed is a very big object, but using a catalog builder in the right sequence works well; I am revising my code to have such a builder with an efficient code.

I am thinking of a primary index having the first 4 rows. The maximum size for this index is 416 * 8! = 16 773 120, but with total number of pairs (band1,band2) around 600 millions, the valid 4 rows must be much lower.

Linked to this primary index, we are left with a file describing the rows 5 to 8. without optimization, each row is one of 9! or 362880 possible permutations of the
9 digits which gives a maximum of 24 digits in text mode and 16*4=64 bits in binary mode.
champagne
2017 Supporter
 
Posts: 7455
Joined: 02 August 2007
Location: France Brittany

Next

Return to Software