The New Sudoku Players' Forum

by **blue** » Thu Dec 22, 2016 5:46 pm

champagne wrote:Here below is a sample of "bad cases" in a part of the file where we had a higher density of such cases. Each case is split in 2 lines

(...)

Note to blue

Reference times for all or part the list of worse cases in the Gary McGuire process would be helpful

Hi champagne,

The output for 3 runs of the code, are below.
In the first, the 17-clue "hitting sets" are produced, but the solver isn't called to check them for being valid 17's.
In the 2nd, the code's original solver (BB solver) is called, for each hitting set. The total time goes up by a factor of ~2.8.
In the 3rd, a different solver is used (an old one of mine), and the idea that I describe near the end of this post, is used.
The times drop back to close to the times from the first list.

Hidden Text: Show

by **champagne** » Fri Dec 23, 2016 3:14 am

blue wrote:
champagne wrote:Here below is a sample of "bad cases" in a part of the file where we had a higher density of such cases. Each case is split in 2 lines

(...)

Note to blue

Reference times for all or part the list of worse cases in the Gary McGuire process would be helpful

Hi champagne,

The output for 3 runs of the code, are below.....

Hi blue,

Thanks a lot for these references. I have now a 10 days break with nearly no time to spend on that topic, but as a first reaction, it seems that most of these grid are not that much worse than average in the McGary's process. The results with your new process will be a good reference for my next version of the code.

I have now in hands a preliminary draft improving the last steps, but tests will start only in 2017.

by **eleven** » Fri Dec 23, 2016 10:09 pm

A beside question.
Are there any efforts in improving the processing times using multiprocessing with graphic cards (cuda) ?

by **champagne** » Sat Dec 24, 2016 8:41 am

eleven wrote:A beside question.
Are there any efforts in improving the processing times using multiprocessing with graphic cards (cuda) ?

Good question.

2 or 3 years ago, my son pushed me to have a look at the potential of graphic cards. At that time, I was working on the solver and I found nothing of value in that.

Here, it would be different; we have for sure a heavy parallel process.

Also, in the McGarry process, (something I copied), UAs tables are packed using a 65536 lookup table, this would done efficiently in a parallel process without table, so yes, you have here some potential. In my process, the tables after say 6 or seven steps are small enough IMO to be sent to a graphic card for the next steps.

But doing so, we add some compatibility issues.

The first step anyway is to built an efficient process.

by **dobrichev** » Thu Jan 05, 2017 9:26 pm

eleven wrote:A beside question.
Are there any efforts in improving the processing times using multiprocessing with graphic cards (cuda) ?

I personally never did serious efforts in this direction.
From the documentation I have read on this subject, the disappointing facts include
- The higher level frameworks are natively focused on 32-bit floating point arithmetics. Integers exist, but there is pain with some of the operations.
- The hardware array of units share the instruction pointer, and executing a conditional branch is ineffective. For example an If..then..else.. operator actually forces "false" units to execute "nop"s during the "then" block, and "true" units to execute "nop"s during the "else" block.

by **dobrichev** » Thu Jan 05, 2017 10:49 pm

blue wrote:The output for 3 runs of the code, are below.
...
In the 3rd, a different solver is used (an old one of mine), and the idea that I describe near the end of this post, is used.

Hi Blue,
Your post was gone in an entropy-dominated area of my mind. Well done!
Aren't 12-cells UA sets in your queue actually "rediscovered" ones? Why you don't make a pass over the all known UA (maybe in reverse order if you are working with sizes up to 12)?

After playing for some time with my implementation of McGuire's process, I realized that he underestimated the fact that the first cell of the top UA is dead for 3/4 of the time (or 5/6 if the top UA is of size 6).
Choosing the less-significant cell for the top UA to be one of the most populated cells, and doing something similar for the next 4 UA, dropped the execution time by ~25%.
This obviously is matter of separate topic, but I can't resist to share that the recent (still immature) version processes 100 (random?) grids for 23 seconds/grid for 17s, and for 3.07" for 16s. Recent code is in Github.
I should code an algorithm that reorders UA by some criteria, and simultaneously reorders cells to a virtual coordinate system so that the profit of the dead clues is maximized. Before calling the solver the cells should be mapped back, but in my implementation this happens very rare (<<5% of the time is consumed by the solver).
Other improvements include dynamic selection of the thresholds used in higher level UA generation, search for shortest unhit UA for up to ~8 unhit ones in advance (instead of scanning the whole unhit collection), ignoring 2-clue UA composites that are supersets of a UA below the threshold (like it is done for higher level composites), doubling the buffer sizes, using 256-bit registers, using 6-clue composites for the easier grids, using larger 4-digit and 4-boxes UA sets, etc.

I still believe that a kind of "divide and conquer" approach like in recent Champagne's work can overperform the classical UA hitting. Go Gerard, I am following your progress with interest.

P.S. In the original McGuire's process, even without reordering and remapping, since the used alphabetical order aligns most significant bits of the subsequent UA, iterating cells within the chosen UA in reverse order should improve performance with several percents. Blue, is it easy to check this?

Cheers,
MD

by **champagne** » Fri Jan 06, 2017 5:37 am

Hi Mladen,

After a long break over the year end, I restated the work on that 17's search trying to improve the situation for the "worst cases" without any damage for the average cases. So far, I changed more the last steps, not in the field of your post, but I worked in that direction earlier, so I have questions and comments.

dobrichev wrote:
After playing for some time with my implementation of McGuire's process, I realized that he underestimated the fact that the first cell of the top UA is dead for 3/4 of the time (or 5/6 if the top UA is of size 6).
Choosing the less-significant cell for the top UA to be one of the most populated cells, and doing something similar for the next 4 UA, dropped the execution time by ~25%.

Optimizing the effect of the dead cells in the first steps sounds good. I made many tests in that direction as you, but defining the optimum is not a simple thing.With algorithms I used, I got good results on specific solutions grids, but a reverse effect on other solution grids. IMO, it can not be a major factor, to define the best process.

In the current code, I am focusing on other issues, and I select the UA's cells directly by a "bitscan" in the UA bits field. Choosing the cells to optimize the "dead effect" is not compatible with that process, so I keep that for later.

BTW, I am not 100% sure to read properly that

Choosing the less-significant cell for the top UA to be one of the most populated cells, and doing something similar for the next 4 UA,

My understanding is that you reorder the cells of the smallest UAs to process first the cell(s) hitting more UAs (in part of the table or in all the table??). Most of the tests I made were in that direction.

dobrichev wrote:but in my implementation this happens very rare (<<5% of the time is consumed by the solver).

The solver weight in the process is low if it is used exclusively as final check of solutions when all other filtering possibilities have been used, so I would agree on that.
I tried to check in my current process the validity of a bands 1+2 partial solutions. Then, the filter appeared too costly. (subject to...)

dobrichev wrote:
I still believe that a kind of "divide and conquer" approach like in recent Champagne's work can overperform the classical UA hitting. Go Gerard, I am following your progress with interest.

I really believe that the current search will give good results. Worst cases specific to the split in {band 1+2; band3} are there, but the process is very efficient for other solution grids.

I made recent changes in the lay-out of the final steps and likely, results will not come before my departure for several weeks on a sunny beach side (where I'll continue to work). I want to go till the test on a sample of solution grids without 17 after these changes and then only, implement other pending ideas.

So far, the next ideas to test that I have in mind are

- An extension of the solution grids filter in band 3 to "non minimal" solutions.
This can be a key issue if the frequency of solutions in bands 1+2 with less clues than necessary to have the minimum number of clues in band 3 is as high as in my tests on solution grids with known 17s

- optimization of the bands 1+2 selection of UAs
Including new attempts to play with the "dead branches" factor

- and as usual, some code optimization.
Having now a better knowledge of the "intrinsic" functions in C++, I am reconsidering some bit operations. Did you try to use the {bit test; bit test and reset...}. These instructions can surely replace in some cases things done in the128 bit fields. (using for example your bitSet table to set or clear a bit)

by **dobrichev** » Fri Jan 06, 2017 9:40 am

champagne wrote:Did you try to use the {bit test; bit test and reset...}. These instructions can surely replace in some cases things done in the128 bit fields. (using for example your bitSet table to set or clear a bit)

No. I am limited to Ivy Bridge (actually Sandy Bridge) architecture.
Note that
- switching from mm128 register to r64 registers and vice versa usually requires storing and loading from memory
- some of the intrinsic functions don't compile to a single fast CPU instruction but to a sequence that is out of your control and is possibly suboptimal for your case

champagne wrote:BTW, I am not 100% sure to read properly that
Choosing the less-significant cell for the top UA to be one of the most populated cells, and doing something similar for the next 4 UA

Consider the following 3 UA4 at the top, where for simplicity the third one is disjoint from the first two.

Code: Select all: 1111000000000...0000 0001111000000...0000 0000000011110...0000

Now count the number of outputs after iterating trough the first 2 UA, taking into account the dead clues.
Do this by iterating the cells from left to right. Then do the same by iterating from right to left. Compare 13 to 10. But you can still iterate from left to right and achieve better output if you exchange the leftmost 3 columns with the rightmost ones.

Code: Select all: 0001000000000...0111 0001111000000...0000 0000000011110...0000

Or even better if you do more complicated column transformation.

Now assume that the third UA for some reason is placed on top. Worse.

Positions of the UA within the list are one degree of freedom. Exploit it by reordering them in a manner that generates less outputs.
Similarly, the cell ordering is another degree of freedom. Exploit it by reordering the cells (the columns in the above example) so that, together with UA reordering, the scanning generates less outputs.
This is a complex task that maybe can be resolved up to some approximation. From other side, a simplified test run over the top several UA takes milliseconds, so it is possible to choose the best one from several alternatives using brute force.

by **champagne** » Fri Jan 06, 2017 12:09 pm

2 comments to your last post

Instructions

bit test
bit test and reset
bit test and set
bit test and complement

are part of the basic instruction set (80286 80386...) The intrinsic function is just supposed to call the corresponding instruction, If I read correctly the instruction description, it can directly set a bit in memory, and this is the final need in many tasks.

Regarding the cells priority, what you show is clear and I made tests in that direction, but as I wrote, I could not find an algorithm performing well.

by **dobrichev** » Fri Jan 06, 2017 4:50 pm

Hi,
You are right.
In the last 10 years the hardware made progress. In older processors these instructions were considered slow (see the the latency of 8-9 on the last row here).
According to discussion on this bug report, in 2011 they are considered fast, and the compilers should optimize a typical operations sequence to this instruction.
I didn't check whether my compiler does so, but at least I should represent the code in suitable for optimization form.
Thanks for the advice.

by **dobrichev** » Fri Jan 06, 2017 8:36 pm

Tested btr instruction for horizontal enumeration of the bits set in mm128 register.
Gnu compiler hasn't appropriate intrinsic function, maybe for a reason. Also it doesn't translate bitfield &= ~(1 << numbits) to btr instruction.
MS compilers has intrinsic function _bittestandreset which takes memory address as operand. See this bug.
Seems that the power of this instruction is its atomic (locked during the instruction execution) form with memory operand, which is slow but guarantees that no other thread touches this memory cell between the reading and the writing.

Currently my code for clearing the current bit is

Code: Select all: bits &= (bits - 1)

Changing it to

Code: Select all: bits &= ~(((uint64_t)1) << offset)

results in a slower code w/o btr instruction involved.
Changing to

Code: Select all: asm("btr %1,%0" : "+r"(bits) : "r"(offset))

enforces usage of btr instruction, but the time is compatible with the shifting from the previous variant.

by **blue** » Fri Jan 06, 2017 11:07 pm

Hi Mladen,

dobrichev wrote:
blue wrote:The output for 3 runs of the code, are below.
...
In the 3rd, a different solver is used (an old one of mine), and the idea that I describe near the end of this post, is used.

Hi Blue,
Your post was gone in an entropy-dominated area of my mind. Well done!

Thank you.

Aren't 12-cells UA sets in your queue actually "rediscovered" ones? Why you don't make a pass over the all known UA (maybe in reverse order if you are working with sizes up to 12)?

It's extremely rare that a 12 cell (or smaller) UA, enters the queue.
Some explanation:

The "front end" in Gary's code, assuming it tracked all UA with size <= 12, would never pass a puzzle that fails to hit all such UA.
Even if it did pass such a puzzle, unless it happened near the beginning of the process, there's a good chance that it would miss one of the larger UA sets that's already in the queue.
(#2) does come into play, though, since the assumption in (1), isn't quite true.
For the outer loops, the code works with a list of up to 768 (minimal) UA's, and it's possible (?) that grids with more than 768 with size <= 12, exist.
After the first 7 clues have been added [ 8 clues for 17's ... maybe not the best choice ... long story ], the code consolidates its UA/multi-UA lists -- removing sets that have already been hit, and trimming the lists to various sizes.
The "single UA" list, gets trimmed to "size <= 128", and with that, some of the larger sets can be dropped.

After playing for some time with my implementation of McGuire's process, I realized that he underestimated the fact that the first cell of the top UA is dead for 3/4 of the time (or 5/6 if the top UA is of size 6).
Choosing the less-significant cell for the top UA to be one of the most populated cells, and doing something similar for the next 4 UA, dropped the execution time by ~25%.

Nice idea !
By "most populated", you mean "ocurrs in the most UA's" ?
If so, you might try weighting the "occurence counts" by a factor (N / <UA size>), for some large N, to reflect the idea that in what comes later, the smaller sets are more likely to be selected for "next clue" sets.

P.S. In the original McGuire's process, even without reordering and remapping, since the used alphabetical order aligns most significant bits of the subsequent UA, iterating cells within the chosen UA in reverse order should improve performance with several percents. Blue, is it easy to check this?

Yes, it's easy. I don't have the time right now now, but I'll try it over the weekend.

I still believe that a kind of "divide and conquer" approach like in recent Champagne's work can overperform the classical UA hitting. Go Gerard, I am following your progress with interest.

I agree 100%. I've been following the discussion here, as well ...

I've let champagne know (in PM) that I've been working on another approach, but I haven't mentioned any of it's concepts.
It uses the divide & conquer approach, ideas from the OP of this thread, champagne's "UA2's" idea, the "found" UA queues on a massive scale, and a (highly) modified version of your FSSS2 code. I keep making improvements, but I'm close to final code, which I'll share.
[ It's so fast now, that nobody would believe me without actually seeing it run. ]

Cheers,
Blue.

by **eleven** » Sat Jan 07, 2017 12:11 am

blue wrote:[ It's so fast now, that nobody would believe me without actually seeing it run. ]

exciting

by **coloin** » Sun Jan 08, 2017 3:41 pm

Good news ...
No doubt the optimization ++ is going to be needed.

Splitting a search up might also yield - provided the improvements were greater than the increased search place.

For example

say we postulated that all 17s have been found whitch begin with max lex {x..x......} or {xx......} I think we have ...
which means that all non-found 17s have max lex {x..x..x..] or more

If we were to search a grid for a 9plus14 [ 9 clues in first line - try to remove 6 of these later] more than 18 times quicker than searching for a 17 de novo then that might be an improvement..

I do hope that searching for a 14 clue solution in 72 clues [with less UA] is more than 18x quicker than searching for a 17 in 81 clues ... is it ?

by **blue** » Sun Jan 08, 2017 9:37 pm

blue wrote:
dobrichev wrote:P.S. In the original McGuire's process, even without reordering and remapping, since the used alphabetical order aligns most significant bits of the subsequent UA, iterating cells within the chosen UA in reverse order should improve performance with several percents. Blue, is it easy to check this?

Yes, it's easy. I don't have the time right now now, but I'll try it over the weekend.

It didn't seem to make a (significant) difference.
Some grids went faster, and some slower, with times changing by as much as 25%.
For (the same) 100 random grids, the average dropped from 21.64s to 21.06s, or about ~2.7%.

The New Sudoku Players' Forum

Bands and low-clue puzzles

Re: Good and bad news in the search of 17s

Re: Good and bad news in the search of 17s

Re: Bands and low-clue puzzles

Re: Bands and low-clue puzzles

Re: Bands and low-clue puzzles

Re: Good and bad news in the search of 17s

Re: Good and bad news in the search of 17s

Re: Good and bad news in the search of 17s

Re: Bands and low-clue puzzles

Re: Bands and low-clue puzzles

Re: Bands and low-clue puzzles

Re: Good and bad news in the search of 17s

Re: Good and bad news in the search of 17s

Re: Bands and low-clue puzzles

Re: Bands and low-clue puzzles