The New Sudoku Players' Forum

by **JasonLion-Admin** » Sat Feb 03, 2018 1:52 am

Hey, this is getting more and more off the topic of Sudoku, which is the purpose of this forum. Please stick to Sudoku related topic

Thanks
Jason
System Admin

by **emerentius_** » Sat Feb 17, 2018 2:23 am

champagne wrote:In most cases, the user does not care about the results, only by the count of solutions (cancelling the task if the count is >1).
In some cases (UA collector on my side for example) the solution is needed, but in a specific context and a tailor made extraction can be done.

Need is a strong word. I'd be surprised if there are any users of my library

I'm just offering a few apis and one of them returns the solutions. Another just counts. I don't have any specific use cases in mind just yet but I'm still expanding the feature set. These are the libraries docs btw. External dependencies are very easy to use in Rust.

The recent clang versions give a nice boost to the solver. I'm seeing a ~2-3% performance jump going from LLVM 4 -> 6 (in Rust) and similarly ~3% for clang 4 -> clang 5 with JCZsolve.

by **emerentius_** » Mon May 28, 2018 7:10 pm

I think I've found a sizeable performance improvement. I'm not sure if it's valid under all circumstances.

In the places that look like the following code snippet in the Update routine, just before calling UPWCL, deleting the if-branch has improved the speed for most sudoku sets by 5-12% (in my Rust code).

Code: Select all: if (((AR>>3) & 7) != S) { AR &= 0777777707 | (S<<3); UPWCL(1, 1, 4, 7, 13, 16, 19, 22, 25); }

It should have the same speedup in C given that both use LLVM. I also have a small dataset of 10 sudokus for which there's a pessimization of ~15%, but top50k, sudoku17 and 20 of the hardest sudokus all show speedups.

Was the check added for performance only or is it needed for correctness in some edge case?
I didn't find any deviations in recognizing multi-solution sudokus or finding the correct solution for proper sudokus. I've checked sudoku17, top50k and FT_1_124, which is 130k sudokus total, among them 15k non-proper ones. I've also generated 1.000.000 proper sudokus. No differences.

Edit: What I find even more exciting is this: Rust has bounds checks on array indexing. With this change and a few others, the performance penalty for that vanishes. My implementation can be 100% memory safe Rust at full speed. I'll refrain from guessing what exactly it is that lets the optimizer do its magic.

by **champagne** » Tue May 29, 2018 10:33 pm

emerentius_ wrote:I think I've found a sizeable performance improvement. I'm not sure if it's valid under all circumstances.

In the places that look like the following code snippet in the Update routine, just before calling UPWCL, deleting the if-branch has improved the speed for most sudoku sets by 5-12% (in my Rust code).
Code: Select all
if (((AR>>3) & 7) != S) { AR &= 0777777707 | (S<<3); UPWCL(1, 1, 4, 7, 13, 16, 19, 22, 25); }

....

What you are doing is not clear, but in the original code, this says

If you have new assignments in the band for the current digit,
update the new "unassigned status"
update the current digit status in other bands
update the other digits status.

Each of these operations has to be done and must be evaluated in the general scope.
I don't foresee a possible improvement skipping the test but ...

by **emerentius_** » Wed May 30, 2018 1:19 pm

champagne wrote:What you are doing is not clear, but in the original code, this says

Sorry, I meant: Skip the check and execute the contained code unconditionally.

champagne wrote:If you have new assignments in the band for the current digit,
update the new "unassigned status"
update the current digit status in other bands
update the other digits status.

Where exactly are you quoting this from?

by **champagne** » Wed May 30, 2018 1:57 pm

emerentius_ wrote:
champagne wrote:If you have new assignments in the band for the current digit,
update the new "unassigned status"
update the current digit status in other bands
update the other digits status.

Where exactly are you quoting this from?

This looks like a code written by me, so I hope that it is doing what I expected :lol:

emerentius_ wrote:Sorry, I meant: Skip the check and execute the contained code unconditionally.

This was my understanding, but if I identified the code, I confirm that I see no reason to speed up the process

by **emerentius_** » Wed May 30, 2018 7:12 pm

champagne wrote:This looks like a code written by me, so I hope that it is doing what I expected

Did you post the code in this thread already? If so, where? I'd be happy for anything that'd help me understand the intricacies of the solver.

champagne wrote: This was my understanding, but if I identified the code, I confirm that I see no reason to speed up the process

Do you mean, you don't see why that would speed it up? There is not much code inside and branch misprediction can be costly.

by **champagne** » Wed May 30, 2018 9:09 pm

emerentius_ wrote:
champagne wrote:This looks like a code written by me, so I hope that it is doing what I expected

Did you post the code in this thread already? If so, where? I'd be happy for anything that'd help me understand the intricacies of the solver.

See the first post page 13 in this thread. JasonLion also posted a C version of the same code

emerentius_ wrote:
champagne wrote: This was my understanding, but if I identified the code, I confirm that I see no reason to speed up the process

Do you mean, you don't see why that would speed it up? There is not much code inside and branch misprediction can be costly.

I don't understand "branch misprediction". This code has been studied by many guys (and is directly derived from the original code) The inside code is activated if and only if an assignment took place in the update sequence just above. Even using the pre processor, it contains about 10 instructions.

by **zhouyundong_2012** » Thu May 31, 2018 8:46 am

I think there are few people actually understand the code, this block the research of this code.
So I have an idea to make a viewable and operatable program which has next step button, prev step button.
what can reader view？ in each one grid, it displays candidate digits

1 2 3
4 5 6
7 8 9
next step
------------------->
1 2 3
4 5 6
7 8 9
next step
------------------->
1 2 3
4 5 6
7 8 9

this is like the “Watch Window" in visual studio, action occurred when F10, F11 is clicked

Maybe one week is more than enough, but I am so lazy now!
If you want to do this helpful work, you can ask me for original code which is written by me. A viewable soduku with GT,LT,Killer, GT/Killer,
and graphics library, animation library, photoshop libraray, text libraray(simplified)

by **emerentius_** » Thu May 31, 2018 4:53 pm

champagne wrote:I don't understand "branch misprediction". This code has been studied by many guys (and is directly derived from the original code) The inside code is activated if and only if an assignment took place in the update sequence just above. Even using the pre processor, it contains about 10 instructions.

At a branch the cpu may try to predict the correct path and will speculatively execute it. If it guessed wrong it has to flush the pipeline and revert all changes. A correct guess makes the branch almost free, an incorrect guess is very costly. There are three such if-blocks inside another if-block. If the outer if is entered, then at least one of the inner ones must also be entered but not necessarily all of them. This seems hard to predict.
The branch condition only contains things that should already be readily available. I'm not an expert on this and it seems to me like it shouldn't need speculative execution but my measurements show a definite improvement in branch misses.

Here are some high level stats collected with perf (linux profiling tool). The numbers for instructions, branches and misses are cumulative. Execute an instruction twice, it counts twice.
My solver in the 2nd column has some slight differences but is mostly equivalent to the original. In my solver in the 3rd column the innermost if-checks in the update function were removed.
Note how the removal of the branches drops the branch misses by 1/3 and how the instructions/cycle jump up. Of course, more instructions are executed but the total speed increases.

Code: Select all: jczsolve mine mine with change sudoku17 time [s] 0.202 0.204 0.182 instructions 683,716,125 744,529,659 825,465,246 instr/cycle 1.16 1,25 1.5 branches 110,258,289 94,151,698 95,898,610 misses 10,447,095 10,346,431 7,355,023 top50k time [s] 0.34 0.335 0.294 instructions 1,060,183,483 1,137,223,944 1,221,260,048 instr/cycle 1.08 1,13 1.41 branches 170,937,968 143,608,199 145,150,639 misses 19,495,500 19,436,784 13,925,122 hard20x500 time [s] 1.646 1.665 1.536 instructions 6,226,479,344 6,650,678,851 6,905,523,959 instr/cycle 1,28 1,35 1,53 branches 1,015,336,637 923,752,040 913,476,946 misses 82,979,035 83,671,900 61,916,728

It's not a complete apples to apples comparison because my solver loads the file completely in advance but my parser does more work. Parsing is not a significant part of the total runtime though.
The jczsolve version I'm using is from Jason Lion's post on p15 (direct link).
I've compiled it with clang5 which makes it a couple % faster than clang 4, as I've noted before. The rust compiler version is 1.27 (nightly, 2018-05-16). Rust uses clang6 already but the big improvement came with clang5.

by **champagne** » Thu May 31, 2018 7:09 pm

I never fight again facts, so this is something to test on my computer later and to try to understand

by **emerentius_** » Wed Jun 06, 2018 4:02 am

I've changed the heuristic in GuessFirstCell to always check the first cell for each of the 3 bands and to choose the cell with the minimum number of possibilities (rather than stop at the first existing unsolved cell). This has no effect on easy sudokus where GuessFirstCell is never called but speeds up very hard sudokus on average by over 10%. I've checked it with your collection of potential hardest from 2017-12. The guess function is also sometimes called in sudoku17 but has no meaningful effect on speed there.

I'm hosting my code here, in case someone is interested. This is the solver's source file and this is the command line program I'm using for benchmarks.

Here are some updated timings (all times in seconds):

Code: Select all: 10x_hard375.txt 500x_hard20.txt 10x_top1465.txt sudoku17.txt top50k.txt jczsolve 0.834 1.625 0.286 0.201 0.341 rusty jczsolve 0.684 1.350 0.252 0.177 0.286 ratio 1.22 1.2 1.13 1.14 1.19

by **champagne** » Wed Jun 06, 2018 6:42 am

emerentius_ wrote:I've changed the heuristic in GuessFirstCell to always check the first cell for each of the 3 bands and to choose the cell with the minimum number of possibilities (rather than stop at the first existing unsolved cell). This has no effect on easy sudokus where GuessFirstCell is never called but speeds up very hard sudokus on average by over 10%. I've checked it with your collection of potential hardest from 2017-12. The guess function is also sometimes called in sudoku17 but has no meaningful effect on speed there.

I'm hosting my code here, in case someone is interested. This is the solver's source file and this is the command line program I'm using for benchmarks.

Not a very common case, but one could also extract the cell with the smallest number of candidates using the parallel process as in the search of singles in cells, bi-values in cell ... with a good chance to speed up the process. (4 candidates in cell should be always possible, so it is 3 or 4 candidates)

I tested also a morphing of the puzzle, but with nothing exciting

by **emerentius_** » Sun Jun 24, 2018 4:13 am

I went over the entire solver to clean up the code and understand it better. I've improved the variable names, simplified the code and added a lot of comments. I could get rid of the manual loop unrolling in ApplySinglesOrEmpty and UPWCL. I've deleted TblRowUniq and I'm using TblShrinkMask in its stead. Several more LUTs turned out to be not really helping earlier.
In total, the solver's file is now at 660 lines of code (a good deal of that just boiler plate) as opposed to jczsolve's 950 (or 800 without the code commented out via #if 0). I never figured out what UPDN or UPWCL was supposed to stand for. :lol:

I think it's much easier to understand now.

As I wrote earlier, I've removed the innermost branches in Update. I've also made some changes to how I unset the bits in unsolvedRows after UPWCL where I've used xor instead of and-masking. At first I didn't get why it worked. Now I understand, what I've been doing was complete bogus but also harmless because, worst case, the solver just falls back to guessing. That indicated however, that the unsolvedRows checks were not that important at all anymore with the innermost if-checks already removed. I've removed any use of unsolvedRows which actually sped up the solver a bit more. All the changes during the cleanup together improved performance by another 4-5%.
I still have to test how that affected solving speed on impossible sudokus. I don't have a data set handy.

In the process of removing unsolvedRows, I've run into a very weird performance issue. This if-check is always true because I never delete any bits from unsolvedRows but the optimizer can't recognize that and therefore doesn't remove the branch. If I remove the if-branch myself or replace it with something predictable, I lose 5%(!) performance. I have no idea why that is. Edit: Some profiling leads me to believe that the if check that the optimizer can't see through leads it to be more conservative with optimizations that would increase the amount of code. Without the branch, the amount of instructions executed rises by 15%. Instructions / cycle goes up as well but not enough. Branch misses stay roughly the same.

Code still available at the Github repo

by **dobrichev** » Sun Jun 24, 2018 7:53 pm

emerentius_ wrote:I never figured out what UPDN or UPWCL was supposed to stand for.
...
Code still available at the Github repo

And that did not stop you from licensing the code without even mentioning the original author and other contributors?

The New Sudoku Players' Forum

3.77us Solver(2.8G CPU, TestCase:17Sodoku)

Re: 3.77us Solver(2.8G CPU, TestCase:17Sodoku)

Re: 3.77us Solver(2.8G CPU, TestCase:17Sodoku)

Re: 3.77us Solver(2.8G CPU, TestCase:17Sodoku)

Re: 3.77us Solver(2.8G CPU, TestCase:17Sodoku)

Re: 3.77us Solver(2.8G CPU, TestCase:17Sodoku)

Re: 3.77us Solver(2.8G CPU, TestCase:17Sodoku)

Re: 3.77us Solver(2.8G CPU, TestCase:17Sodoku)

Re: 3.77us Solver(2.8G CPU, TestCase:17Sodoku)

Re: 3.77us Solver(2.8G CPU, TestCase:17Sodoku)

Re: 3.77us Solver(2.8G CPU, TestCase:17Sodoku)

Re: 3.77us Solver(2.8G CPU, TestCase:17Sodoku)

Re: 3.77us Solver(2.8G CPU, TestCase:17Sodoku)

Re: 3.77us Solver(2.8G CPU, TestCase:17Sodoku)

Re: 3.77us Solver(2.8G CPU, TestCase:17Sodoku)

Re: 3.77us Solver(2.8G CPU, TestCase:17Sodoku)