The New Sudoku Players' Forum

by **Mathimagics** » Mon Jul 08, 2019 2:44 pm

coloin wrote:.... if the pattern is kept the same ..... all puzzles generated will be from different solution grids ( excluding 8/9 clue swops ) ( excluding automorphic )

True, and the worker processes use both same-pattern morphing and {+1, -1} morphing methods to find new grids. Unfortunately, using ONLY the same-pattern method doesn't succeed in enough cases to provide a sufficient supply of new "seeds" for the process. So we use both methods ...

by **Mathimagics** » Mon Jul 08, 2019 2:48 pm

BTW guys, anybody who is in possession of puzzle collections of 20C or less can contribute to this project by simply sending me those puzzles.

18C or 19C collections would be particularly valuable, for obvious reasons! 8-)

by **m_b_metcalf** » Mon Jul 08, 2019 4:00 pm

No sooner said, ...

by **Mathimagics** » Mon Jul 08, 2019 4:32 pm

Thanks Mike! 8-)

PS: coloin, champagne: I have responded to the points you raised, see last post of the previous page ...

by **coloin** » Mon Jul 08, 2019 7:22 pm

Yes ... will send you some ? Plenty of 18C and 19C collections in the next few days. C

by **champagne** » Mon Jul 08, 2019 9:47 pm

The 20 clues area has been searched for hardest I have some 20 to share from my seeds file (more than in he file of potential hardest)
I have also a small number of 18 clues, likely redundant with other files.
I hope to get in the near future (mid august ??) a collection of 18 clues with a band having 2 clues

Will tell you more soon via pm

PS BTW, the file of potential hardest is available with puzzles of 20 clues and more

by **Mathimagics** » Tue Jul 09, 2019 6:40 am

Hi Mladen!

I think your model makes sense, but would be more appropriate if I did not want to actually keep track of a sample puzzle for each grid. A flat bit-mask would be more efficient otherwise. If I have misunderstood your outline, I'm sure you'll let me know!

dobrichev wrote:Key phrases: "sort -mu", "comm -13".
On windows use gsort since the built-in stupid sort isn't easy to remove.

The worker processes that generate the candidate puzzles do have to sort their results, and remove duplicates. Currently they invoke the builtin Windoze sort via system("sort …."). The duplicates are then removed by the program. Both sort and dup-removal steps are slow particularly on the final pass for each batch produced (typically 30 million records, reducing to 3 or 4 million).

I have a fast sort utility (XSM, from http://www.hhns.fr/xsm/en/) which can do both sort & dup removal in one go, and has excellent performance, but I can't use it here because the free version, while fully functional, has a "Press Enter to continue" fudge to foil automation. The package is geared towards high-end commercial users, so it is not cheap. But I do use it often for manual big sort/merge tasks so can't complain, it has been of great use.

I'm working on a new worker which will use the qsort function, and that should speed things quite up a bit.

But I don't know anything about "gsort", what is that? And your "key phrases" mean what?

Cheers
MM

by **dobrichev** » Tue Jul 09, 2019 8:28 am

OK, let start from the final result and then talk about the possible methods.

We want to fill this table

Code: Select all: minClues | numGrids ---------------------- 17 | 18 | 19 | 20 | 21 |

In order to accumulate the number, we must keep a track of the grids having (and/or not having) puzzle of respective size.
One possibility is to have one 5,472,730,538 bits long file for each puzzle size where N-th bit is set if we know that the N-th grid from a catalogue is known to have at least one puzzle with respective puzzle size. No puzzle exemplar is stored.
Finally the table above will contain the number of bits set in the respective file.

For 17's we can set the bits for the known solution grids. But how to get the grid number within the catalogue? By ordering the grids with 17's in the same way as we ordered the used catalogue and then doing a single pass over the catalogue and the result file.

For 18's we can safely set all bits from 17's and do further search, and so on for 19's and others.

Is the above clear?

by **Mathimagics** » Tue Jul 09, 2019 11:09 am

dobrichev wrote:Is the above clear?

Yes! I thank you …

Before we go on, it's probably worth reviewing just how my existing ED catalog is organised. The 5.47 billion grids are all in CF, and are ordered. Physically there is a separate file for each band, but logically they are an ordered set of 5.47 billion grids.

Given an arbitrary grid G, in CF, I identify its index number ID(G) by a binary search on the applicable band. A function FindGrid(G) returns ID(G), and is optimised assuming that the candidate grids for which we need to find GN's are themselves ordered. Tables of # of grids in each band, first grid in band, last grid in band, all assist with this process. Bands are loaded on demand, so the more catalog updates we can assemble and order to form a batch, the less the number of passes we need to make through the bands.

Also, I have to point out that my objective is actually slightly different to the one you have given. I want more than the final count table, I want to store an exemplar puzzle for each grid, reflecting the best known N for that grid in a self-evident fashion.

For any arbitrary puzzle X, we first convert it to G+P, where G is the solution in CF, and P has unique solution G. We get ID(G) from the ED catalog, and this tells exactly where to find its entry in the LCTP catalog. This will be G if no puzzle yet recorded, or a puzzle B, the best-so-far puzzle found. If P is better then we update the entry.

This means (in current model) it is a simple matter to add information for say, coloin's 18C collection - we create a batch file for the collection and just include that batch in our update procedure - this performs updates via a single pass over the bands by merging up to 64 batches of potentially new grids/puzzles.

I'm just reviewing where I am currently at, not arguing with your method. Over to you! 8-)

by **dobrichev** » Tue Jul 09, 2019 2:32 pm

OK, the details from your latest post are quickly focusing to the essentials.

Partitioning the catalogue by band has some advantages and disadvantages which I prefer to skip. Let assume we are working with the whole catalogue.

The most important thing in this data model (beyond the puzzles generation) is obtaining the grid index, right?

My point is that the binary search isn't effective for huge files, but is very effective for chunks that fit in RAM.
You can organize process so that the data (grid + puzzle size + puzzle exemplar) are locally kept in a smaller ordered in-memory container, sort of std::map<grid, <std::pair<int, puzzle>>.
When the data size exceeds the RAM, then it can be dumped to a file. At this point we still don't know the grid index.
Then you can
- combine several files in a larger one on single pass, see https://en.wikipedia.org/wiki/Sort-merge_join. The GNU Coreutils (available for windows) command for merging is "sort -mu", where -mu is merge, unique.
- find differences between files using command "comm". Applicable for determining the newly found records, those that have no corresponding records in the older (already processed) file.
- find grid indexes and update the global puzzle catalogue on a single pass by implementing merge-join in your code.

I know that after my suggestions for UA scanning :oops:

, you will try to do exactly the opposite of what I wrote here. On your place I would do the same. :lol:

Some old code for working with collections of compressed grids can be found here.
One possibility to encode a puzzle (of a known size and known solution grid) is to write 80-bit map of the givens, determining the 81-st by size of the puzzle. The code in the above link uses 80 bits per grid, and one of them is free too.

In general that covers all that I wanted to contribute.

by **Mathimagics** » Tue Jul 09, 2019 4:38 pm

dobrichev wrote:I know that after my suggestions for UA scanning , you will try to do exactly the opposite of what I wrote here. On your place I would do the same.

Given my well-known tendency for blunders (search "humiliation" + userid="mathimagics"), and also considering your track record, I think you can safely assume that I do consider all of your suggestions very seriously! :cool:

dobrichev wrote:The most important thing in this data model (beyond the puzzles generation) is obtaining the grid index, right?
My point is that the binary search isn't effective for huge files, but is very effective for chunks that fit in RAM.

Yes, the critical operation is finding the grid index.

But I have 32Gb of RAM, and the largest band (Band 6) file is less than 8Gb, so I can keep both the ED catalog (used for the index finding), AND the LCTP copy both in RAM and still only use 50% of memory.

That's why I decided not to re-organise the band-based catalog (I have already merged most of the high-end smaller bands, so my catalog actually has just 255 "bands").

It could probably be improved by having more evenly balanced files, but all the evidence suggests that the real bottleneck for the batch update process is not the binary search, but getting the bands in and out of memory. Search times for the indexing operation represent perhaps less than 5% of the overall time for the job.

For example, I'm getting load times that are at best 140MB/s. So I takes at least 15s to load a 2Gb band. This seems sluggish and I'm looking for a better method (than conventional binary file IO) … I tried using memory-mapped file method (MapViewOfFile) but that seemed to make little difference. Any suggestions here would be useful …

(PS: Thanks for the CoreUtils info, I will try them out …)

[EDIT] I did find one way to improve the band load time! Excluding the catalog folders from Window Defender seems to have perked it up somewhat …. Band 6 (7GB) loaded in 13s

by **coloin** » Tue Jul 09, 2019 6:09 pm

Well .... I think its worthwhile for now for Mathemagics project to try to find any more grids without a 20 ... [if any]
The generation process is certainly fast for 20s and probably will get almost all the grids.
Im thinking that not many grids will only have a solitary 20 .... most grids will have plenty of 20s ? - aside from those automorphic grids which ave reduced numbers of puzzles but we will see.

If we get a batch of grids say 10000 grids [without a 20] left to process.... the ones with a high MCN will be doable with a "checker" process. There are other methods which might churn out a puzzle pseudo randomly in a short time [job done]

Also saving a puzzle exemplar is worthwhile for the 20 and no 19 grids, not so much for the 19C puzzles, And certainly worthwhile for the 18C puzzles.

Looking ahead ... if we get the logistics of the documentation sorted ....

Puzzle generation ....doing a {-1+1} on an internal 18C puzzle will generate the closed group of 1e9 puzzles at around 3000 per minute, probably faster with faster solvers.....

Generating the 19 C puzzles similarly [ and probably much faster - 6000 puzzles per minute] might give us one puzzle per the majority of the grids in reasonable time too.

by **m_b_metcalf** » Tue Jul 09, 2019 6:32 pm

Mathimagics wrote:BTW guys, anybody who is in possession of puzzle collections of 20C or less can contribute to this project by simply sending me those puzzles.

18C or 19C collections would be particularly valuable, for obvious reasons!

Here are two small collections of 18-clue puzzles. (Are you hoping for hundreds or for zillions?) The first is derived (using a stochastic-walk vicinity search) from the 19-clue set. Many are symmetrical.

The second set is that posted by olimpia here, followed by some neighbours (obtained using the same method). They are all symmetrical.

Regards,

Mike

by **Mathimagics** » Tue Jul 09, 2019 7:29 pm

m_b_metcalf wrote:Are you hoping for hundreds or for zillions?

As many as you can come up with! Estimates suggest 30% of ED grids have 18C puzzles - so perhaps 2 billion - this makes 18 the worst-case of all. 17C, 19C, 20C are all "nice" cases, since we expect "almost all" grids to have no 17C, but will have 19C/20C. So any 18's that you can find are going to save time when the project eventually moves on to looking at 18C's.

coloin wrote: I think its worthwhile for now for Mathimagics project to try to find any more grids without a 20 ... [if any]
The generation process is certainly fast for 20s and probably will get almost all the grids.

Im thinking that not many grids will only have a solitary 20 .... most grids will have plenty of 20s ? - aside from those automorphic grids which have reduced numbers of puzzles but we will see.

Indeed we will. blue has already checked the automorphic grids and identified just 4 grids with no 20C puzzle. I am hoping LCTP-20 will eventually give us a suitably small pool of non-automorphic grids that need a closer look at (ie that we can't find 20C puzzles for by any of the usual methods).

by **blue** » Wed Jul 10, 2019 5:44 pm

Mathimagics wrote:Estimates suggest 30% of ED grids have 18C puzzles - so perhaps 2 billion

It's closer to 17.6%, or roughly one in 6, with ~60% of them having only one 18 -- see the end of this post.

Mathimagics wrote:Im thinking that not many grids will only have a solitary 20 .... most grids will have plenty of 20s ?

Average counts per grid are ~3.2 million (minimal) 20s (and ~2200 minimal 19s) -- see Afmob's results here.

--

I did some testing with random grids.
10,000,000 samples produced 461 grids with a 20 and no 19, putting the total count in the ~~229-276K~~ range.
100,000,000 samples, and every one had a 20, putting the "21, but no 20" count at "no more than 164" (with 95% confidence, I think).

Edit: For "20, but no 19" it was 10M "non-automorphic" grids sampled, with 461 hits.
There are 4815 "automorphic" grids with a 20, but no 19 ... putting the total count in the 234-280K range.

The New Sudoku Players' Forum

Low/Hi Clue Thresholds

Re: Low/Hi Clue Thresholds

Re: Low/Hi Clue Thresholds

Re: Low/Hi Clue Thresholds

Re: Low/Hi Clue Thresholds

Re: Low/Hi Clue Thresholds

Re: Low/Hi Clue Thresholds

Re: Low/Hi Clue Thresholds

Re: Low/Hi Clue Thresholds

Re: Low/Hi Clue Thresholds

Re: Low/Hi Clue Thresholds

Re: Low/Hi Clue Thresholds

Re: Low/Hi Clue Thresholds

Re: Low/Hi Clue Thresholds

Re: Low/Hi Clue Thresholds

Re: Low/Hi Clue Thresholds