The New Sudoku Players' Forum

by **champagne** » Thu Oct 26, 2017 6:46 am

The "final" code for the scan 566;655 of the solution grids is now available. (before next bug :roll:

)

The documentation has not changed since the previous post. (URL in the first post)

The release can be downloaded at the following place here

The zip file contains

. the source code in the directory src
. the project definition used on the platform microsoft visual studio
. a short readme file
. sk17p2a.exe executable code under windows
. _zbad50.txt a list of 20 puzzles giving very bad run times in pass 2.

The _zbad50.txt file is a list of entries with some weeks ago a run time over 50 seconds and now an average run time of 3.4 seconds per entry on my PC.

The code has been checked against the known 17 and all of them showed up in a process using only one band 3.

by the end of this week, the scan will be started (delayed due to a hardware big problem).

The expected run time with the last code is around 0.2 second per solution grid, with an average 0.01 second per solution grid for the bands 1 having the smallest count in valid bands with 6 clues.

A first summary will be done after the processing of the 10 bands 1 with the lowest number of valid 6 clues (about 10% of the total count of solution grids, exactly 435 188 953 solution grids)

by **coloin** » Thu Oct 26, 2017 12:26 pm

Very good.
I understand now how the 665 will also have to be done on your complete sample ...
surely there will be no bugs .... if you implement a 656,566 and the ? costly 665 ... this has to work !

maybe in the 665 search a way to limit the number of Band1&2 options would be to divide the search into a max clues in box of 2,3,4 or 5 - but with already implemented stack reductions this might not achieve anything overall !

The reduced sample size with reduced Band1&2 with band 3 [=5e9 exactly] - is this the area where you think a bug might be - when some grids might be missed ?

Choosing the easiest band1&2 [of the 6 pairs in any solution grid] with lowest completions all contribute to the efficiency ...
and because of this - we don't have to search for the same puzzle 5 other times using the 5 other band1&2 in the ED solution

With the increased efficiency in choosing and clearing band1&2 and associated multi solution band 3s, does blues first pass [ with band 3 more than 6] become even quicker ?

by **champagne** » Thu Oct 26, 2017 3:13 pm

Hi Coloin,

I understand now how the 665 will also have to be done on your complete sample ...
surely there will be no bugs .... if you implement a 656,566 and the ? costly 665 ... this has to work !

Yes, the code as it is can process the 665 case, but as you say, it is a costly process. I applied the 665 search in the test with the known 17 (665) file to be sure to have the code valid.

The big sample (more than 20 000 ED bands 1+2) was run excluding the 665 case. I am waiting for a fix of my hardware problem to rework it with the "best processor". I did so just because I am now 100% convinced that a separate pass with an appropriate entry file (again just 5e9 bands 3) will work better. Knowing that we will have more ED bands 1+2 to process and that the number of XY will be slightly higher, we could think of a run time equal to 0.7-0.8 of the (566+656) pass.

maybe in the 665 search a way to limit the number of Band1&2 options would be to divide the search into a max clues in box of 2,3,4 or 5 - but with already implemented stack reductions this might not achieve anything overall !

I don't catch your point. After the above remark, a 665 step in the same run than the 566 656 would be applied only for your 18 clues search 222222222. (BTW is it just an idea or do you have a true interest in getting these 18). There is very little to modify in the program to do that.

The reduced sample size with reduced Band1&2 with band 3 [=5e9 exactly] - is this the area where you think a bug might be - when some grids might be missed ?

History learns us that a program with no bug is usually a program with a limited number of remaining bugs. Hopefully, some of them are in cases never seen, and others only affect the performance. For me, the program is bug free

.

With the increased efficiency in choosing and clearing band1&2 and associated multi solution band 3s, does blues first pass [ with band 3 more than 6] become even quicker ?

Without the code, it is impossible
a) to say that all features activated by blue are there
b) that the coding is better than in blue's program (blue worked likely several years part time on this topic).
From messages sent by blue, I am sure that
a) he worked with less UAs GUAs (but he used intensively UAs GUAs generated by the brute force)
b) he did not implement the 2X2Y loop. This loop IMO generates huge savings in "bad cases"

The best we can hope is to have him back to comment on this.

Creating the program for pass 1 (band3 with >=7 clues) will be my next task. I'll finish with the appropriate entry file to be run to bypass the 665 case

by **coloin** » Sat Oct 28, 2017 8:26 pm

champagne wrote:(BTW is it just an idea or do you have a true interest in getting these 18 ?

Well its an idea to speed up your search -but it might not .
Given that a likely new 17 is going to have 3 clues in at least one of the boxes ... probably ....
the 222222221 have been searched for [ via repetitive production of 222222222]
and a new 4******** which has 665,665 seems unlikely
so
a 332222111 like
312
221
132
a search specifically for band 1 with 321, 312, 231, 213, 123, 132 clues per box etc and associated band 2s [221, 212, 122 ]

the point being that i suspect splitting it up like this isn't necessarily going to speed up the complete search !

although an empty box essentially means that a band will be 320 or 330 [ if we say that there wont be a new 17 with 4 clues in any one box - albeit a big assumption]

by **champagne** » Thu Nov 02, 2017 6:39 am

While the scan for the 566;656 cases is on going, I started the drafting of the code for the "pass 1" in blue's terminology, testing each band/stack of the catalogue for >=7 clues.

Blue describes the entry "optimal file" as 983 959 110 ED bands 1+2 with 32 835 497 303 bands 3
The "entry code" used in the released pass 2a with a band 3 filter limited to the clearing of "auto morphs" produces exactly the same count.

Here below, some data from blue out of a pm

32 835 497 303 - ED <bands 1&2, band 3> pairs
32 836 383 228 = 6 * 5 472 730 538 - estimate using "G classes" (e.g. min lexical grid catalogue)
32 840 614 326 = eligible band 3 fills, summed over ED "bands 1&2" inputs (current code)

32 836 383 228/32 835 497 303= 1 + 0.00027% - "redundancy" using (G-class)
32 840 614 326/32 835 497 303= 1 + 0.01558% - "redundancy" for current code, "(N>=7)+x+y" version

and the code used to produce the entry giving the expected optimal count (released
void GEN_BANDES_12::M10Find_band3B
where the code irrelevant for this case has been released)

Hidden Text: Show

The redundancy effect comes from bands with a possible morphing to initial after bands and stacks have been exchanged. It seems that the current algorithm is "optimal" toward that possibility.

The other difference with blue's entry file is the sorting sequence. Here bands have in band 1 the band with the "lowest count of valid bands with 6 clues". This should be a better sorting sequence, although the bonus is much smaller than in the pass 566;656

The average number of bands 3 per entry is 33.37 and the average per band 1 is in the range 32-37 except for the last band (21.32). So the distribution of bands 3 is quite homogeneous.

Next step is to revise the bands expansion to cover all possibilities from 3 to 7 clues. From here, my estimate is that a draft tested for blue pass 1 could be available by the end of November.

by **champagne** » Fri Nov 17, 2017 6:20 pm

The first version of the code for the search of 17 clues puzzles having one band/stack with >=7 clues is now released here sk17p1.

This code has passed the first test looking for known 17 but the first runs in final conditions showed a bug (I already suspected the same in the pass 2a).

The bug is now fixed, but all the work done in the pass 2a has to be re done, what will come later.

I run the pass 1 on the sample entry file with the same 20362 ED bands 1+2 spread over the 415 first bands. The test has been done with 7 cores fully occupied in parallel. The average number of bands 3 has been slightly over 32 and the run time 281 milliseconds per ED bands 1+2. My average estimate for a full process is around 300 milliseconds per entry corresponding to <60 milliseconds per solution grid.

We are in the range of the figures given by blue who claimed among others an average 7.8 ms per solution grid for a 16 clues search, something that I'll test after adjustments of the program.

the scan has been restarted for the pass 1. A more precise test on the first 6 bands showed that all known 17 should now be produced.

Next post will give the results of the pass 1 for the first 20 bands 1, and likely the code for the 16 clues search..

A new page has been added in the comments for the pass 1 (see first post for the link).

by **champagne** » Sun Nov 19, 2017 3:52 pm

eleven wrote:blue,
i'm looking forward to raise my glass to your health, when you will publish your code.
Incredible improvement compared to McGuire's work.

Hi eleven,

I am not quite sure about the improvement against McGuire's work, but I released a code in the mood of blue's findings that seems in the good range.

I have still a possible improvement in mind (for all the 17 process), this is mainly for bands with a minimum number of clues <4, so I have some weeks to implement and test it.

I put in the first post the links to the up-to-date versions of the different programs released

17 clues pass 1
17 clues pass 2a
17 clues pass 2b (not yet ready)
16 clues search.

Normally, these are without debugging code, but anybody willing to get a version including some debugging code can send me a pm or a mail.
The .exe is always for windows and requires a 64 bit processor plus the sse3 full compatibility (not an old AMD)

by **eleven** » Sun Nov 19, 2017 5:48 pm

champagne wrote:I am not quite sure about the improvement against McGuire's work, but I released a code in the mood of blue's findings that seems in the good range.

Hi champagne,

when you write in your readme, that your code is about 350 times faster than the one by McGuire&al, this is an incredible improvement. Using it, McGuire's scan could have been done in a day instead of a year with that powerful computer environment.

As already mentioned in a pm, i neither have time nor hardware resources to contribute in the project. But i am optimistic, that enough people will do, when it comes to the big scan.

by **champagne** » Sun Nov 19, 2017 6:15 pm

eleven wrote:when you write in your readme, that your code is about 350 times faster than the one by McGuire&al, this is an incredible improvement. Using it, McGuire's scan could have been done in a day instead of a year with that powerful computer environment.

The 1:350 ratio was given by blue, I think that this was the ratio for a sample processed by blue, I don't have an exec of Gary McGuire's code to do the benchmark. Blue compiled the sources released by the Gary McGuire team. He had the possibility to benchmark both programs.

Generally speaking, as far as I remember, in the article, it is said that the average run time per solution grid was over 3 seconds per solution grid. This is equivalent here to an average 48 ms per entry (for an average 3 seconds).

The only thing that I can say is that this is where we are with the released code. But I can't say if the milliseconds are similar or not. (And blue can also have coded better things)

I'll see whether the planned change speed up the process or not

by **eleven** » Sun Nov 19, 2017 9:41 pm

champagne wrote:Generally speaking, as far as I remember, in the article, it is said that the average run time per solution grid was over 3 seconds per solution grid.

Yes, Gary McGuire (active on this forum as Moschopulus) said there:
"The entire computation took about 7.1 million core hours on the Stokes machine at ICHEC. Stokes is an SGI Altix ICE 8200EX cluster with 320 compute nodes. Each node has two Intel (Westmere) Xeon X5650 hex-core processors and 24 GB of RAM.
...
The average running time of our final version of the newchecker is about 3.6 seconds per grid (on a single core of the above model of CPU, with hyper-threading enabled)."
(the code was optimized during the computation, deviding the core hours by the number of grids gives an average of 4,67 seconds)

by **champagne** » Mon Nov 20, 2017 8:16 am

eleven wrote:
champagne wrote:Generally speaking, as far as I remember, in the article, it is said that the average run time per solution grid was over 3 seconds per solution grid.

Yes, Gary McGuire (active on this forum as Moschopulus) said there:
"The entire computation took about 7.1 million core hours on the Stokes machine at ICHEC. Stokes is an SGI Altix ICE 8200EX cluster with 320 compute nodes. Each node has two Intel (Westmere) Xeon X5650 hex-core processors and 24 GB of RAM.
...
The average running time of our final version of the newchecker is about 3.6 seconds per grid (on a single core of the above model of CPU, with hyper-threading enabled)."
(the code was optimized during the computation, deviding the core hours by the number of grids gives an average of 4,67 seconds)

Thanks eleven for these data.

7.1 million core hours is more or less 810 cores working in parallel. I can activate 20 cores !!!! (of inequal performance)

The next interesting point is the pair of "seconds per grid"

3.6 seconds per grid on a single core
4.67 seconds per grid in average

It would be good to learn more from Gary McGuire, but running 7 cores in parallel, I see a deterioration in the timings that I roughly estimate around 20% (but it can be worse).

The sk17p1 released can still have bugs but I am not expecting to find bugs having a significant effet on the global performance, so I am confident that the average 300 ms per entry with 7 cores in parallel is reliable (on my best processor).

The ratio 1:3 between sk16 and sk17p1 is reliable as well, so this would give an average 100 ms per entry (18ms per solution grid) that we could compare to the 4.67 seconds giving here a ratio of 260.

With no true benchmarking potential, this is the best that I can do. I am confident that blue produced a reliable ratio, but I can not guarantee that the sample was representative.

by **dobrichev** » Mon Nov 20, 2017 1:22 pm

champagne wrote:running 7 cores in parallel, I see a deterioration in the timings that I roughly estimate around 20% (but it can be worse).

You should be happy if the performance degradation is only 20%.

For a code where the memory access is the bottleneck, you can halve the processing speed by only running in parallel a second instance of the program, which is sufficient to pollute the cache.

Following the simplified hardware diagram

RAM <-> cache <-> registers <-> ALU

the ideal scalability is achievable only when the inner loops work only on registers.

The "hyperthreading" is nothing more than a secondary family of registers to which the same ALU can switch while the code waits for an external (to ALU) operation to finish.
The cache is shared in a complicated way by all CPU and is easy to become a bottleneck.
The RAM is shared too and concurrent reads/writes even on different addresses cause the processors to wait each other.

The art is either to
- find a good balance between memory operations and ALU instructions (for example few additional tables with intermediate results could work fine on single thread but cause cache pollution on 2 or more parallel threads and total performance is better when not using them), which is a hardware specific optimization that could not work the same way on different machines, or
- instead of running several instances of the process, run a single instance and run in parallel its middle loops. This increases the chances different threads to use large amount of the same data from the same memory addresses, leaving more free cache for use by the thread-specific data.

Congratulations for the progress done.

I am still busy with "primorial soup" puzzle and found no time to look whether your source code is read/write or write-only.

Cheers,
MD

by **champagne** » Mon Nov 20, 2017 3:19 pm

Hi mladen,

champagne wrote:running 7 cores in parallel, I see a deterioration in the timings that I roughly estimate around 20% (but it can be worse).

You should be happy if the performance degradation is only 20%.

You are likely right, but as I wanted in priority to test the pass 1 (BTW I found another small bug in the vicinity of the last one), I just tested the sample with the full load of the processor.
I’ll try to have a better estimate of the deterioration later, on the “final” version of the code.

I started the 10 first “band 1” on the same processor (the best one) and I am waiting for the closure of the 3 bands with the highest number of ED bands 1+2 to have an idea of the average run time per ED band 1+2 in this area (2.7% of the bands 3 to test in the pass 1, would be much more in pass 2).

For a code where the memory access is the bottleneck, you can halve the processing speed by only running in parallel a second instance of the program, which is sufficient to pollute the cache.
Following the simplified hardware diagram
RAM <-> cache <-> registers <-> ALU
the ideal scalability is achievable only when the inner loops work only on registers.

Clear, but not so easy to apply. Here, the inner loop uses a 256x256 matrix in a fix place and the code is limited to some tens of instructions, as for example

Code: Select all: for (ix = 0; ix < nxc; ix++, px = &px[2]){ store = ix << 8; //8 low bits for iy register uint64_t *py = vyc; for (iy = 0; iy < nyc; iy++, py=&py[2]){ if (px[0] & py[0]|| px[1] & py[1]) continue; *pstore++ = store | iy;// (ix<<8) | iy } }

Where nxc=nyc=256, and this is likely 80% of the core consumption.

The art is either to
- find a good balance between memory operations and ALU instructions (for example few additional tables with intermediate results could work fine on single thread but cause cache pollution on 2 or more parallel threads and total performance is better when not using them),

If you are thinking of lookup tables, I tried to kill most of them for the reasons exposed here. Some more cleaning is needed to reach the optimum.

- instead of running several instances of the process, run a single instance and run in parallel its middle loops. This increases the chances different threads to use large amount of the same data from the same memory addresses, leaving more free cache for use by the thread-specific data.

Not so easy to implement, and this is why I stick to what I can manage, the same program run in parallel, but one could think of running the same band 1 in parallel, sharing the corresponding tables.

found no time to look whether your source code is read/write or write-only.

do you mean watching or contributing??

My last attempt to improve the process is to apply a 2X3Y or a 3X3Y for the bands with a high number of valid bands.

This goes in the same direction, the drawback is double
time spent in the “vectors building”,
more chunks far from the 256x256 size

so this has to be excluded for bands with a small number of valid bands

Cheers
champagne

by **dobrichev** » Mon Nov 20, 2017 3:53 pm

champagne wrote:If you are thinking of lookup tables...

No. Which lookup tables to use is a different problem although related. In my terminology a lookup table consists of constants which values are known prior to execution.
I meant intermediate results which could be calculated once and eventually reused rather than re-calculated again and again.

champagne wrote:
found no time to look whether your source code is read/write or write-only.

do you mean watching or contributing??

Don't expect from me to contribute using directly your executable. I would firstly try to understand your code, then port it to my platform, then test it. If the code isn't readable (according to my own standards), then these tasks are time-consuming.
In order to estimate the readability of your code I must download it but I still didn't.
That said, at the moment I am not ready to comment or ask you for details.
Contributing is (a hypothetical) later stage.

by **champagne** » Mon Dec 18, 2017 9:09 am

17 clues search milestone

Note : obsolete posts have been deleted

The ongoing search is the pass 1 in blues terminology : the search of 17 clue sudokus having at least one band with >=7 clues

The up to date code is sk17p1V3.1.zip (see post 1)

So far the 56 first bands (out of 416) have been closed and verified . All known 17 of that kind are there, but no new 17. (this with a tiny window for errors in the first bands run with a bugged version of the program)

The pass2a code (no band/stack >=7 clues 566 656 clues distribution) has been adjusted to the last process and the "debugging code free" version uploaded as well.

The pass2b code (equivalent to a 665 distribution in the pass2a is not yet ready. This is the next point on my to do list.

The update of the comments will be done ASAP.

Nothing new about the efficiency of the last versions, I want in priority to run more bands to do the test on a bug-free (if this exists) code. Leaving Brittany for a long period in some weeks, I'll do it anyway before departure.

The 16 clues search has been updated out of the sk17p1V3.exe and released, but no test has been done on it except that I run the big test again on my laptop and I got an average 136 ms per “band 1”, likely equivalent to the 7 ms per solution grid given by blue.

The New Sudoku Players' Forum

scan solution grids for 17 clues as of blue

Re: scan solution grids for 17 clues as of blue

Re: scan solution grids for 17 clues as of blue

Re: scan solution grids for 17 clues as of blue

Re: scan solution grids for 17 clues as of blue

Re: scan solution grids for 17 clues as of blue

Re: scan solution grids for 17 clues as of blue

Re: scan solution grids for 17 clues as of blue

Re: scan solution grids for 17 clues as of blue

Re: scan solution grids for 17 clues as of blue

Re: scan solution grids for 17 clues as of blue

Re: scan solution grids for 17 clues as of blue

Re: scan solution grids for 17 clues as of blue

Re: scan solution grids for 17 clues as of blue

Re: scan solution grids for 17 clues as of blue

Re: scan solution grids for 17 clues as of blue