The New Sudoku Players' Forum

by **champagne** » Sun Apr 21, 2024 5:50 pm

coloin wrote:Pretty good work 1
On the generating side [ before the pruning] Its only necessary to use 44 gangster bands { rather than 416]
This of course doesnt reduce the grids to be pruned by minlexgriding.

Use of the gangsters is more in line with my opening statement to start the expansion using band1/stack1.
This is likely good to have the quickest catalog builder, but unless I miss a point, this does not deliver a catalog directly in minlex order.

In fact, as producing the catalog is a on shot problem, having the best code is not challenging.

I am not sure that there is clients for a direct link
solution grid <==> 5e9 rank

but if yes, then we need either a huge lookup file, or an efficient catalog builder in the right order as basis.

by **champagne** » Wed Apr 24, 2024 7:20 am

The builder of the catalog in the min lexical order produced the NED=5 472 730 538 solution grids matching mladen'index.
This is now a virtual storage of the catalog through the builder.

The builder can start from any of these existing starts of the solution grids

416 band 1
9992 band1 plus mini row 1 in the box 4
652408 band1 plus row4

the code (dirty and partial so far) is available here

The first test has been done using the band1 table of starts. Next step will be to use band1 plus row 4 as start to see how long it takes to cover the entire field.
The first run let's hope some hundreds of milliseconds per { band1,row4 ] to find the chunk of solution grids.

The final target remains to have quickly the link { solution grid <==> NED rank}
At the end, something I never done, a DLL should be created.

The final size of this virtual storage of the catalog should be around 4/5 MB

by **champagne** » Thu May 09, 2024 8:40 am

The draft is now ready,

The core of the package is clearly the catalog builder in the min lexical order of the solution grids.

Several entries are available

for a direct access
give the rank 1-5472730538 get the solution grid
give the solution grid get the rank (option known minimal or not)
if the solution grid is not minimal, then the minlex morph is built

This is the part should be proposed in a DLL. The response is expected to be in average in some tens of milliseconds.

Are also available
the possibility to print part or all the table of the rows 4 with the index 0-415 of the band 1
And some pieces of the catalog for a given range of the rows 4

Some tests and some cleaning are still to come, but the repository has the current status of the code.
Note: the code has not yet been checked with the gcc compilation

here below

the bottom part of the rows catalog
first 4 rows; band 1 index 0-415; rows 4 start 0-652407; solution grid rank start; number of solution grids attached

Hidden Text: Show

and the bottom part of the catalog with the rank (done in 86 ms using the rows 4 entries

Hidden Text: Show

EDIT: the .exe size with microsoft virtual studio is 3.5 MB so this is somehow a virtual catalog in min lexical order of that size.

by **dobrichev** » Thu May 09, 2024 10:48 am

Hi Champagne,

Congratulations for the achievements!
To my knowledge this will be the first independent verification of the gsf's catalogue.

I am waiting for timings for entire catalog generation.

Cheers,
MD

by **champagne** » Thu May 09, 2024 2:04 pm

Hi mladen,

dobrichev wrote:To my knowledge this will be the first independent verification of the gsf's catalogue.

This is not quite true, but the process is hidden in the search of the 17 clues.
In the last version of this scan, we have 3 builders of the catalog, each one with it's canonical form, not min lexical.
One basic test has been to check that we get the 5472730538 ED solution grids.

dobrichev wrote:I am waiting for timings for entire catalog generation.

I start the process on one of my workers. It will be an enumeration process, but it's not so important for the full generation. The process based on the rows 4 is expected to be very close to the generation from scratch for the first bands, but mush faster later as shown in the example above

Cheers
Gérard

by **dobrichev** » Thu May 09, 2024 7:27 pm

Regarding puzzle collections compression I found a post from gsf from Nov 2006

gsf wrote:...

assume an 82 byte text puzzle form [0-9]*81 + newline for each puzzle
and allow related puzzles to appear near to each other
e.g., solvers that generate minimal puzzles from a given puzzle
typically list the puzzle in some order where only a few clues / empty cells
differ from line to line

a compressor that compresses windows of lines column-wise with move-to-front
run length encoding with a huffman back end compresses my year old
225M puzzle catalog from 18,453,487,706 bytes down to 252,266,884 bytes,
or 1.121 bytes per puzzle

by **champagne** » Fri May 10, 2024 9:27 pm

I made small changes in the code

to have a code in line with gcc specifications
to have an independent count check in the full generation using the rows 4

The full generation took a little more than 25 hours on one core of an i7_4770 3.4GHz, a relatively old computer (5 years??)
This gives an average 60400 solution grids per second

================================

Regarding gsf's 1.121 bytes per puzzle as explained in the comments, this is for a very special chunk of puzzles.
It remains that the standard compressing tools perform very well for files of puzzles or files of solution grids.
Unhappily, such files are not easy to use, this is why I open this thread.

=================================

I have still many tests to run on this draft, but from here, I see one way to build a data base of puzzles with 18 clues, the other topic for this thread.

Having a 18 clues puzzles, it can be
first solved
then moved with the solution grid in min lexical morph of the solution grid
then expressed as the rank of the solution grid plus a 81 bit field

With such a canonical form it is easy to find redundant puzzles and the files can easily be packed either using binary files or using the standard compressing tools.

by **dobrichev** » Sat May 11, 2024 8:44 am

Gerard, building catalog from scratch for one day is impressive!
Congratulations!
With your instrumentation soon we will fly to stars.

by **coloin** » Sat May 11, 2024 7:35 pm

champagne wrote:.......I see one way to build a data base of puzzles with 18 clues, the other topic for this thread.

wow ... i hadnt thought of that one ... indeed
well done on getting so far with the first task !!

by **dobrichev** » Wed May 15, 2024 1:41 pm

For getting the grid catalog index for any grid using catalog file I experimented with these few lines of code

Code: Select all: void getGridIndexes() { // Memory-mapped I/O https://www.gnu.org/software/libc/manual/html_node/Memory_002dmapped-I_002fO.html FILE *cat_file = fopen("my_binary_catalog_file", "r"); // 325 grids/sec for 46300 grids, peak 9 GB RAM. 525 grids/sec for the second run, 4600 grids/sec for the third run = time for canonicalization colIndexes9* cat9 = (colIndexes9*)mmap(NULL, 5472730538 * sizeof(colIndexes9), PROT_READ, MAP_PRIVATE, fileno(cat_file), 0); // an array of catalog items, in this case a 9-bytes structure encoding 7 templates indexes fclose(cat_file); if(MAP_FAILED == cat9) { fprintf(stderr, "Error in memory mapping file\n"); exit(1); } char buf[1000]; // source grid is read here int ind[7]; // the 7 template indexes in raw notation while(fgets(buf, sizeof(buf), stdin)) { // read grid text representation templCanon::templateMinLex(buf, ind); // canonicalize the grid, in this case storing the 7 template indexes in ind colIndexes ci(ind); // compact raw indexes into an intermediate 13 bytes structure colIndexes9 ci9(ci); // compact further to a 9 bytes structure which is the catalog record to search for. auto res = std::lower_bound(&cat9[0], &cat9[5472730538], ci9); // perform binary search into catalog and return iterator to the record found printf("%ld\n", res - cat9); // conver the iterator to a zero based grid index, and output } }

For test case I used 46300 grids represented and ordered in minlex form. For min-templates form this is quite a random input.
For first grids, the performance is in order of 10-20 grids/sec, and later it improves due to cached records expansion.
For the whole set, about 9 GB of RAM has been consumed (which should be automatically discarded when OS decides low-memory condition).
The reading and discarding is done by "pages", optimized by OS.
Observed read rate is 20 MB/s from a mechanical hard disk which allows 80-100 MB/s sequential read, and 100 MB/s from an ancient SSD drive.
The test case execution times are:
4m28" first run from HDD, 170 grids/s
2m22" first run from SSD, 325 grids/s
1m28" second run from SSD, 525 grids/s
11" third and later run from SSD, 4600 grids/s which is defacto the canonicalization speed w/o binary search.

In order binary earch to work with this code, you have to provide a comparison function which returns whether the first of the two provided record items is less then the second.
For min-templates catalog the function expands the 9-bytes records to 13-bytes ones and compares each 7 column indexes.
In the worst case (for any catalog) this function can expand the catalog records to grids in text representation, and compare the grids according to the respective catalog rules.

As mentioned earlier, this approach makes sense for small set of random grids.
For large sets more performant approach would be full catalog scan after the grids are represented and ordered in their canonical form.

For puzzles, the canonicalization should be able to transform puzzle into its canonical grid, taking into account additional equivalence rules for automorphic grids.

by **champagne** » Thu May 16, 2024 6:29 am

Hi mladen,

I have no doubt that we have many alternative ways to get the { Rank<==> solution grids} on small chunks of the catalog. Using the row4 indexes we get a piece of catalog with less than 70000 solution grids having the same four rows.
A lookup on such a chunk is not too hard to build.
The proposed code here gives the { Rank<==> solution grids} with a virtual catalog, and this could be of interest in some work requiring a quick direct link.

But your template approach seems very interesting to improve my process.
Each band 3, in the catalog building process is in one of the 44 gangster classes. On the fly or using tables with a morphing process, It is possible to produce the potential bands 3 using the templates.

I am exploring this. I don't know if the work has already been done. A separate thread is worth to register the results of the process applied to the 44 gangsters in
min lexical canonical morph

Code: Select all: "123456789123456789123456789","123456789123456789123457689", "123456789123456789124357689","123456789123456789124378569", "123456789123456789147258369","123456789123457689123458679", "123456789123457689123468579","123456789123457689124356789", "123456789123457689124358679","123456789123457689124367589", "123456789123457689124368579","123456789123457689124389567", "123456789123457689126345789","123456789123457689126347589", "123456789123457689126348579","123456789123457689145267389", "123456789123457689145268379","123456789123457689146258379", "123456789123457689148259367","123456789124357689125348679", "123456789124357689125367489","123456789124357689125368479", "123456789124357689125378469","123456789124357689126358479", "123456789124357689126378459","123456789124357689126389457", "123456789124357689128345679","123456789124357689128356479", "123456789124357689128359467","123456789124357689134258679", "123456789124357689134268579","123456789124357689135268479", "123456789124357689135278469","123456789124357689136258479", "123456789124357689136278459","123456789124357689137268459", "123456789124357689138259467","123456789124357689138269457", "123456789124357689158267349","123456789124378569129356478", "123456789124378569135279468","123456789124378569137245689", "123456789124378569157268349","123456789147258369159267348",

My draft should be ready in 2 or 3 days. I am writing it in such a way that transposing it to the catalog building process could be relatively easy

by **champagne** » Thu Jul 04, 2024 1:49 pm

At the end, the replacement of the classical progressive fill of the band 3 by a gangster mapping and the morphing of the valid fills per gangster pays.

details on the gangster identification, mapping and valid fills can be seen here

Applying this, the full catalog can be produced in less than 12 core*hours.About half of the previous code.
I also applied the idea brought by coloin to check in the row5 expansion for duplicated mini rows, but the global effect is very small. In fact, we just save the time to fill rows 5 and 6 before the current check that a filled band 2 has not an index too low.

The code has now to be cleaned, and a final validation will take place.

Using visual studio, we get a virtual catalog of all solution grids in min lexical order of 3 MB, far far away of the best proposal seen up to now.
And the rank <=> solution grid transformation is done in "micro" or "milli" seconds depending on the band 1. The bad case is at the top of the catalog where each row4 can produce thousands of solution grids.

The repository has the current status f the code, before cleaning.

by **champagne** » Wed Jul 17, 2024 5:57 am

More comments after the last posts in another thread
http://forum.enjoysudoku.com/post347332.html#p347332

For the first issue, the "virtual catalog", the draft is closed, are missing the cleaning and a switch to a DLL.
I am trying to design a better mapping of a band to it's canonical morph, a critical part of the code now.

Some thoughts about the second issue : what should be the design for a catalog of the 18 clues puzzles.

We know that we have 5 472 730 538 ED solution grids and for the 18 clues ED puzzles,

"blue" wrote long ago ... my estimate of 1,934,000,000 18C puzzles
And "Mathemagics" was working on another estimate that about 20% only of the solution grids have 18s

Let's start on the idea that a 18 puzzles is just know by its sequence number in a sequence given in a first shoot
solution grid => 18s ranked in a canonical morph to define.

Then, the virtual catalog can be used as start to build the virtual data base of 18s.

Mathimagics's lost work was to establish the list of solution grids having a 18. This gives a one bit per solution grid property easy to use as filter in a virtual data base.
I have a DLL from blue, but I don't know if the copy that I ave is "public" . Any way, we need a 18 solution builder for a solution grid. My code is not yet ready;

The first shoot would be a long long process, but is needed to get the start rank for the main index of the virtual catalog, the 652408 possible first 4 rows

by **champagne** » Sat Sep 07, 2024 8:21 pm

I am not too far to have my best proposal for a “virtual catalog”.
With the first use of the new band mapping, I had a full catalog generation around 10h10mn.

I then applied the new mapping to the last steps of the process and tried to optimize the code in this critical area.
Meantime, I added the filter suggested by “coloin” , with a quick back tracking as soon as a mini column appears when the first band is over index 30 (123456789 457) as start.

I reached a full generation in 3h 20m 57s 530ms on my old i7.

Far below my expectations but reaching the right count with a bug would be very surprising, so I take this as correct subject to more investigations.

The repository has the corresponding version of the code.

The new band mapping is ok for the catalog, but needs still some code for the bands1 having no min lex solution grid,
The min lex band mapping process is tailored made for this use. The previous code (still there) remains so far the correct call for a mapping done in any context.

I have still in mind an idea of “coloin” to use gangsters in band 1 to share part of the downstream process, but this seems hard to implement

And there is a lot of cleaning to do to have a nice version. Moreover, a DLL is needed for a wider use, something out of my skills.

Note: the code has been compiled with microsoft c++. I have to check the gcc compilation.

by **champagne** » Tue Oct 22, 2024 5:36 am

The virtual catalog DLL design is now locked and the draft is in test.

At the end, I have only three main functionalities

Code: Select all: Rank => minlex solution grid Solution grid => rank Sequential processing on a virtual file

Rank => minlex solution grid
Here nothing special, the DLL send back the solution grid for the given rank

Solution grid => rank

The solution grid can be minlex or not
And the source can be char 1-9 or integer 0-8

This gives four different calls.
If the solution grid is not minlex, all data of value used in the minlexing are sent back.

Sequential processing on a virtual file

For anybody willing to work on the solution grids, we have then one way to do it:

Have an external loop on a slice of ranks,
Get the solution grid using the relevant call.

This is not a nice way to do it.
The DLL offers another way, a sequential access to a “virtual file”

The calling program must first open the virtual file, then get the “virtual records of solution grids”.

The file must process a chunk of partial solution with the first four rows known using internal tables.
In the first release, the call is for a chunk of “first bands”.

The reason to do so is quite simple. In the process, for each row 4 (four rows) known to have solution grids, a catalog builder is run to produce the attached solution grids.
This can be 63000 attached solution grids, and this is the only point where the process can be cut.

So, the “virtual file” proposed is in fact

Loop on the chunk of rows 4 asked
Produce the attached file
Delivers them in sequential mode.

With the GetNext () call, the process works in a slightly different way

Take the next in the buffer of attached solution grids
If the last one has been delivered, take the next row 4 and fill the buffer.
Do this till the last row 4.

In this mode, the average solution grid “get” is expected far below the millisecond

Control of the process and exchange file.

The user has first to tell what is expected (which of the three options).
Also, an exchange structure pointer must be shared. This is used to send back more than only the rank or the solution grid.

Some details are still to be adjusted, but the ongoing test is nearly ready to be released.

I’ll do it in a separate thread with a new repository containing
the code,
the DLLs used,
the user .h files
and a sample code using the DLL

The New Sudoku Players' Forum

High density files for solution grids and 18 clues puzzles

Re: High density files for solution grids and 18 clues puzzl

Re: High density files for solution grids and 18 clues puzzl

Re: High density files for solution grids and 18 clues puzzl

Re: High density files for solution grids and 18 clues puzzl

Re: High density files for solution grids and 18 clues puzzl

Re: High density files for solution grids and 18 clues puzzl

Re: High density files for solution grids and 18 clues puzzl

Re: High density files for solution grids and 18 clues puzzl

Re: High density files for solution grids and 18 clues puzzl

Re: High density files for solution grids and 18 clues puzzl

Re: High density files for solution grids and 18 clues puzzl

Re: High density files for solution grids and 18 clues puzzl

Re: High density files for solution grids and 18 clues puzzl

Re: High density files for solution grids and 18 clues puzzl

virtual catalog nearly ready