The New Sudoku Players' Forum

by **PIsaacson** » Wed Dec 08, 2010 8:03 pm

I've included formatting options for each variable (%r %d %p% ...) such that you can specify printf width.precision options to force rounding and avoid scores like N.999999. The default format is "%81.81g;%.1r/%.1d/%.1p" which trims the original puzzle display to just 81 characters in case there's extraneous info after, then specifies a 1 digit precision after the decimal for the rating/diamond/pearl scores by converting them to a %.1f formatter.

From my help display:

Code: Select all: Paul@pgisaacs2 ~/se_cpp $ sudoku --help Version - Alpha (x86_64-w64-mingw32-g++) Sudoku Explainer C++ engine ver 0.3 usage - sudoku -bvdmp -f: -c: -a: -h: -r: -l: -e: -n: -q: s: -i: -o: [< puzzles_in] [> solution_log] [2> statistics_log] --help print the help info -b --validate_moves perform validation of eliminations/assignments - default(no) -v --verbose print all the individual solution steps - default (no) -d --debug enables printing internal debug info tables - default (no) -m --minlex convert puzzle to minlex before processing - default (no) -p --pm_grid input is a single PM grid puzzle - default (no) -c N --se_conforming=N set the degree of SE conformity 0 = none - default (1) -a N:N set the als min:max candidate sizes used in chaining group extensions - default (3:5) --als_min=N set the als min candidate size separately --als_max=N set the als max candidate size separately -h N:N set the ahs min:max candidate sizes used in chaining group extensions - default (3:5) --ahs_min=N set the ahs min candidate size separately --ahs_max=N set the ahs max candidate size separately -l N:N set the aals min:max candidate sizes used in chaining group extensions - default (3:5) --aals_min=N set the aals min candidate size separately --aals_max=N set the aals max candidate size separately -r N:N set the arc min:max candidate sizes used in chaining group extensions - default (3:5) --arc_min=N set the arc min candidate size separately --arc_max=N set the arc max candidate size separately -e N --exiting=N set the exiting criteria - default (0) -n N:N set the chain length min:max sizes used during BFS chain development - default (4:32) --chain_min=N set the chain length min size --nrc_min=N alternate for setting the chain length min size --chain_max=N set the chain lenght max size --nrc_max=N alternate for setting the chain length max size -q N --z_target=N used for debugging - input a single nRrCc z-target to test - default (-1) -s N --seed=N a non-zero seed value will randomize the graph edges for minimal tree spanning - default (0) -i F --input=F override stdin with a specific file - default (NULL) -o F --output=F override stdout with a specific file - default (NULL) -f S --format=S specify the output formatting string - default (%81.81g;%.1r/%.1d/%.1p) The input/output params require a valid filename - if spaces are embedded, then quote the filename The --format string uses the following specifiers: %g - original puzzle input line %s - 81 char puzzle solution %n - puzzle sequential ordinal position wiithin a collection %r - puzzle overall rating %d - puzzle diamond score %p - puzzle pearl score %e - puzzle solution elapsed time in msecs %% - a single literal '%' percent sign chars - any string of ascii characters may be embedded within the format at any position The specifiers may optionally include standard printf modifiers [+|-|0][size][.precision]

I have some test code that allows a per technique score calculation using the variables $depth for the current depth parameter passed, $base for the current base score from the config file and $size for the number of cells defining the pattern. So there could be an entry for each line/technique in the sudoku.ini that contains something like "$base + (($depth + 1) * 0.1) + ($size * 0.1)" which is approximately the ur_loop_type_3 internal calculation. But in testing this approach with a standard interpreter library, it greatly decreased the speed of the fastest and most often called techniques such as hidden_singles so I'm somewhat loathe to offer this as an option. That means the scoring is locked into using SE algorithms, which is more of an issue with me than whether or not to use floating point for the scoring variables. I'm still looking into various options for offering a more flexible scoring system by compiling the equations while reading/processing the sudoku.ini configuration file and placing the code in a dll which I then dynamically load, but it's on my back-burner for now.

Cheers,
Paul

by **gsf** » Thu Dec 09, 2010 12:10 am

PIsaacson wrote:I've included formatting options for each variable (%r %d %p% ...) such that you can specify printf width.precision options to force rounding and avoid scores like N.999999. The default format is "%81.81g;%.1r/%.1d/%.1p" which trims the original puzzle display to just 81 characters in case there's extraneous info after, then specifies a 1 digit precision after the decimal for the rating/diamond/pearl scores by converting them to a %.1f formatter.

I've been watching in the background for a while
should have some time over the holidays

it would be good to eliminate floating point for score computation
internally multiply all ratings by 10 and keep them as integers
on output do a pseudo float format "%d.%d", rating/10, rating%10
this will eliminate system dependent rounding differences

by **PIsaacson** » Thu Dec 09, 2010 10:06 am

Glenn,

The library I'm using to test ad-hoc scoring equations is: http://warp.povusers.org/FunctionParser/

I've played with it for a few weeks, and I think the default double FunctionParser template class is decently fast and offers a better range of builtin functions than the long int FunctionParser template class. What's interesting about this particular library is that it converts the equations into pseudo-code, it contains an optimizer, and it can call external C/C++ functions. Considering that double offers about 16 digits of significance, I don't see where there would be extreme problems with accuracy of scoring.

My objection to using fixed point math is that you have to manually track the decimal point whenever you use multiplication or division and rescale back to your assumed decimal point. Exponents, log, pow... none of these are even operational with the long int templates, and I'm not sure how you could even handle the scaling correctly for these types of functions.

While emulating 2 decimal places, whenever you multiply A * B, you have to divide by 100. Conversely, before you divide A / B, you need to multiply A by 100. Not necessarily a show stopper, but isn't floating point arithmetic much easier to use??? Especially since upon displaying the results with a specified precision ala %.2f, you'll ensure than the RTL rounds correctly. Unless there is some major difference in how some obscure OS's RTL handles this???

What am I missing here???
Paul

by **JasonLion** » Thu Dec 09, 2010 2:15 pm

I have a couple of comments on how you could make things a little more general and a little faster.

PIsaacson wrote:
Code: Select all
10 char given[9][9]; // values base 1 with zero being an un-assigned cell 11 12 int rc_cands[9][9]; // RC space with bits 0-8 used for candidates 13 int rn_cands[9][9]; // RN space with bits 0-8 used for col indexes 14 int cn_cands[9][9]; // CN space with bits 0-8 used for row indexes 15 int bn_cands[9][9]; // BN space with bits 0-8 used for box offset indexes // 0-2 for upper row 3-5 for middle row and 6-8 for bottom row

I store given as a single dimensional array indexed by cell number instead of by row/col. Single dimensional arrays are slightly faster. It is less important, but I also store givens as a C string in ASCII in the same format puzzles are usually exchanged, ie ".7.3.5.242......3.....2618..5...7.18..2...76.79.6...4.163974852.8....47.4275.8.9." instead of in binary. This provides a tiny time savings and also seems to ease debugging.

I combine rn_cands, cn_cands, and bn_cands into a single array that gives information about all houses, essentially hn_cnads[27][9]. Some algorithms are easier to write iterating over houses instead of rows/columns/blocks. It is simple enough to provide defines that emulate your naming, so code which prefers separate rows/columns/blocks can still run even after this change.

I use a typedef for the type of all of the bitmaps you are storing in ints. On most machines int is the best thing to use, but occasionally unsigned int or unsigned short is more efficient. A typedef allows the code to be adapted to those platforms very easily.

I have an additional array containing a mixture of givens and solved cells in bitmap form. You might call it "int known[81];". Each cell contains either zero if it is not yet solved, or a bitmap with one bit set for the single digit it contains (either a clue or a cell already solved). I also have another array with one int/bitmap per house containing a bitmap of which digits have yet to be placed in that house.

I also provide some more abstract interfaces to all of this data for use by some of the most complex techniques. For example, it is handy to be able to make bit vectors with one bit per cell and do various logical operations on those.

Finally, some people track tabling information incrementally as well, though I do not. Finding mutant fish, which I don't think SE supports, benefits from incrementally maintain tabling information.

PIsaacson wrote:
Code: Select all
4) Fitting a technique into the framework after writing it requires: a) Modifying sudoku.cpp to #include the source and adding lines to initialize the proc_tbl and to create the statistics functions b) Modifying grid.cpp to include the necessary method definitions c) Modifying grobal.h to include the necessary macros to define the statistics counters

I find this approach to be a little error prone, especially for a team project. It is simpler if you make the solving techniques be pure functions that take a grid as an argument. That way the class definition for grid doesn't need to be changed when a new technique is added. I would go beyond that and make a class that tracks all of the available techniques. A statically initialized object of that class for each technique can be made to register the technique into the master list of techniques. This allows 100% of the code related to a specific solving technique to be contained in it's own source file. Which techniques are linked in then controls the master list of available techniques. Another layer on top of that would edit/order/rank this list for each run depending on a configuration file.

by **gsf** » Thu Dec 09, 2010 3:21 pm

PIsaacson wrote:While emulating 2 decimal places, whenever you multiply A * B, you have to divide by 100. Conversely, before you divide A / B, you need to multiply A by 100. Not necessarily a show stopper, but isn't floating point arithmetic much easier to use??? Especially since upon displaying the results with a specified precision ala %.2f, you'll ensure than the RTL rounds correctly. Unless there is some major difference in how some obscure OS's RTL handles this???

What am I missing here???

not missing anything about full blown floating point
but aren't the rating operations:
* convert from string to internal
* convert from internal to string
* add/subtract two ratings
* sort/compare ratings
* multiply a rating by an integer (covered by addition)

from an abstract view, where would it make sense to multiply, divide, sin() etc. ratings?

from an implementation view, you will never run into comparison anomalies with pseudo-floating point
with real floating point there will be some architecture somewhere that will lose enough
precision such that x.0 != x.0 or a value lists as x.99999...
this is what we saw with SE vs some java runtimes

by **JasonLion** » Thu Dec 09, 2010 3:57 pm

Here is a more specific example of where a floating point problem could come up in a rating situation. Consider two techniques, one of which has a base rating of 5.0 and in this example has three modifiers that each add 0.1 to the rating, while the other technique has a rating of 5.3 with no modifiers. These two numbers, 5.0+0.1+0.1+0.1 and 5.3, may or may not be equal to each other. On two different platforms it is possible for those two numbers to be equal, ranked in one order, or ranked in the other order. Should this come up, the rating program might end up picking different techniques at that rating step, which could cause it to take different solving paths and thus give significantly different ratings to the puzzle on two different platforms.

by **champagne** » Thu Dec 09, 2010 4:08 pm

JasonLion wrote:Here is a more specific example of where a floating point problem could come up in a rating situation. Consider two techniques, one of which has a base rating of 5.0 and in this example has three modifiers that each add 0.1 to the rating, while the other technique has a rating of 5.3 with no modifiers. These two numbers, 5.0+0.1+0.1+0.1 and 5.3, may or may not be equal to each other. On two different platforms it is possible for those two numbers to be equal, ranked in one order, or ranked in the other order. Should this come up, the rating program might end up picking different techniques at that rating step, which could cause it to take different solving paths and thus give significantly different ratings to the puzzle on two different platforms.

I am highly surprised to see such a long discussion on that topic.

I switched without any problem to an integer "rating = 10* SE rating". The problem of the division comes once per puzzle at print time, which is nothing.

I would have done it even if no problem had been reported.

champagne

by **daj95376** » Thu Dec 09, 2010 10:13 pm

If SE's print function doesn't round on an output format of xx.x, then "5.0 + 0.1 + 0.1 + 0.1" will be 5.2. Converting to a 10* integer as gsf suggests will result in " 50 + 1 + 1 + 1" = 53 with an output of 5.3. Anyone see a problem here?

If SE's print function does round on an output format of xx.x, then it's going to produce 5.3 as well. It will take an awful lot of "+0.1" additions to get to a point where the rounded output would be incorrect in an output format of xx.x.

That said, I favor using an integer over floating point whenever possible.

by **dobrichev** » Thu Dec 09, 2010 10:19 pm

JasonLion wrote:I store given as a single dimensional array indexed by cell number instead of by row/col. Single dimensional arrays are slightly faster.

((char(*)[81])given)[i] can be used where appropriate.

by **PIsaacson** » Thu Dec 09, 2010 10:41 pm

Jason,

Thanks for the comments and suggestions. I've flipped back and forth over the years on using int x[81] vs int x[9][9], but extensive benchmarking has never shown any significant difference in timing techniques to convince me that one is absolutely better than the other. The one advantage of the [9][9] layout is that you can always use int *x = &array[0][0] and process the flattened view of the array very easily. My belief is that focusing on such low level implementation details rarely improves performance as significantly as focusing on the primary algorithms themselves.

From Knuth's Structured Programming with go to Statements

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3 %. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail.

Since Knuth's like a god of software, I've listened, believed, and let gprof be my guide on where to optimize and what low hanging fruit to pluck.

As for integer vs floating point: Perhaps I didn't make my point sufficiently clear about why I'm using doubles now. I looked and looked and selected what I thought was the best C++ function parser library available, but its long int implementation doesn't work well enough to consider using it as a general purpose "role your own and write your own equations". If there is something better available for free/public use, please reply with the url and I'll investigate that library.

My intent is not just to emulate SE, but to provide a framework that's flexible and provides alternative scoring algorithms without re-compiling. So what happens if I want to try increments of 1/32 or 0.03125 to preclude scoring overlap with an int implementation??? The easy (?) answer is to include a parameter specifying the precision, but that's already there with my enhancement of the --format options with floating point.

I don't understand this dread of floating point. Surely using various epsilon comparison algorithms obviates the concerns about equality and inequality??? Again, if anyone can provide a parsing library the works with fixed point as well as the fparser referenced above, I'll be willing to re-consider...

Cheers,
Paul

by **daj95376** » Thu Dec 09, 2010 11:01 pm

PIsaacson wrote:So what happens if I want to try increments of 1/32 or 0.03125 to preclude scoring overlap with an int implementation???

I believe gsf's statement can be changed to: 32* integer with pseudo float format "%d.%d", rating/32, rating%32

by **ronk** » Thu Dec 09, 2010 11:19 pm

PIsaacson wrote:My intent is not just to emulate SE, but to provide a framework that's flexible and provides alternative scoring algorithms without re-compiling. So what happens if I want to try increments of 1/32 or 0.03125 to preclude scoring overlap with an int implementation???

Granularity only a nerd could love, me thinks.

by **PIsaacson** » Fri Dec 10, 2010 2:57 am

Here's a simple test of the equation you need to emulate SE's chain length addition to a base score:

Code: Select all: #include <stdio.h> #include <math.h> double get_score (double x_in) { return (0.2 * log2 (x_in) - 0.4); } int main () { double steps[] = {4,6,8,12,16,24,32,48,64,96,128,192,256,384,512,768,1024,1536,2048,3072,4096,6144,8192}; for (int n = 0; n <= 22; n++) { double score = get_score (steps[n]); printf ("%4d %6.0f %6.1f\n", n, steps[n], score); } }

The (0.2 * log2 (x_in) - 0.4) is impossible (or at least extremely difficult - probably need a polynomial???) to duplicate using integer arithmetic as an equation. It requires a special function that loops and tests whether or not the chain length is between any of the steps and adds 0.1 for every "step" that it takes along the way. Change the format to include additional fractional digits if you want, but the problem isn't with the equation or floating point, it's with the use of the split point in the log curve by doubling the 6 series as if it always exactly split the 4 series (which follows the powers of 2 very nicely 4..8..16..32..64..128..256..512..1024..2048..4096..8192).

The 32nd example is lame, but if you wanted 5 fractional decimal places or 3 or something other than 1, floating point is probably a better choice.

Cheers,
Paul

by **ronk** » Fri Dec 10, 2010 4:03 am

PIsaacson wrote:Here's a simple test of the equation you need to emulate SE's chain length addition to a base score:
Code: Select all
#include <stdio.h> #include <math.h> double get_score (double x_in) { return (0.2 * log2 (x_in) - 0.4); }

I thought Explainer was using something virtually identical to:

Code: Select all: double getLengthDifficulty(int length) { double added = 0.0; int ceil = 4; int isOdd = false; while (length > ceil) { added += 0.1; if (!isOdd) ceil = (ceil * 3) / 2; else ceil = (ceil * 4) / 3; isOdd = !isOdd; } return added; }

Until a scoring match with Explainer is obtained, wouldn't it be safer to use the exact same thing?

by **daj95376** » Fri Dec 10, 2010 4:22 am

[Withdrawn]

The New Sudoku Players' Forum

Team project: C or C++ Explainer-like rating program

Re: Team project: C or C++ Explainer-like rating program

Re: Team project: C or C++ Explainer-like rating program

Re: Team project: C or C++ Explainer-like rating program

Re: Team project: C or C++ Explainer-like rating program

Re: Team project: C or C++ Explainer-like rating program

Re: Team project: C or C++ Explainer-like rating program

Re: Team project: C or C++ Explainer-like rating program

Re: Team project: C or C++ Explainer-like rating program

Re: Team project: C or C++ Explainer-like rating program

Re: Team project: C or C++ Explainer-like rating program

Re: Team project: C or C++ Explainer-like rating program

Re: Team project: C or C++ Explainer-like rating program

Re: Team project: C or C++ Explainer-like rating program

Re: Team project: C or C++ Explainer-like rating program

Re: Team project: C or C++ Explainer-like rating program