How to represent a sudoku with deleted candidates as string?

Everything about Sudoku that doesn't fit in one of the other sections

Postby daj95376 » Mon Feb 12, 2007 7:32 am

Okay, let's see where this goes. Start with a puzzle presented by Keith.

Code: Select all
076090080002003009030600000100500000069020430000006008000001050600200800020050170

Now, move along to where he presents the following PM.

Code: Select all
+----------------+----------------+----------------+
| 5    7    6    | 4    9    2    | 3    8    1    |
| 48   148  2    | 78   178  3    | 5    6    9    |
| 9    3    18   | 6    18   5    | 7    4    2    |
+----------------+----------------+----------------+
| 1    48   348  | 5    348  9    | 6    2    7    |
| 78   6    9    | 1    2    78   | 4    3    5    |
| 2    45   3457 | 37   347  6    | 9    1    8    |
+----------------+----------------+----------------+
| 478  9    478  | 38   6    1    | 2    5    34   |
| 6    15   15   | 2    37   47   | 8    9    34   |
| 3    2    48   | 9    5    48   | 1    7    6    |
+----------------+----------------+----------------+

At this point, we can represent the PM as an 81-char sequence -- sans elimination information.

Code: Select all
576492381002003569930605742100509627069120435200006918090061250600200890320950176

Now, create a PM of the above sequence and include the elimination information.

Code: Select all
*-----------------------------------------------------------*
| 5     7     6     | 4     9     2     | 3     8     1     |
| 48    148   2     | 78    178   3     | 5     6     9     |
| 9     3     18    | 6     18    5     | 7     4     2     |
|-------------------+-------------------+-------------------|
| 1     48    348   | 5     348   9     | 6     2     7     |
| 78    6     9     | 1     2     78    | 4     3     5     |
| 2     45    3457  | 37    347   6     | 9     1     8     |
|-------------------+-------------------+-------------------|
| 478   9     478   | 38-7  6     1     | 2     5     34    |
| 6     15-4  15-47 | 2     37-4  47    | 8     9     34    |
| 3     2     48    | 9     5     48    | 1     7     6     |
*-----------------------------------------------------------*

Q: How do we add elimination information to the second sequence?
A: By replacing <0> with uppercase and lowercase letters {a..Ia..i} in cells with eliminations.

Code: Select all
                                                         *      *** *
                                                         7      447 4
576492381002003569930605742100509627069120435200006918090G612506DDg2D0890320950176

Uppercase letters always start eliminations for a cell. If more than one elimination exists for a cell, then the additional eliminations are represented by lowercase letters.

Only cells with multiple eliminations will ever increase the length of the sequence. I don't claim this approach is suitable for all PMs. However, it may be suitable for the vast majority.

Note: I have no idea why anyone would prefer a compressed PM format over the original, multi-line format. There's no way to *mark* the PM with other information.
daj95376
2014 Supporter
 
Posts: 2624
Joined: 15 May 2006

Postby Stephan Heuscher » Mon Feb 12, 2007 7:15 pm

udosuk wrote:Personally I read and understand the Chinese characters well, so it looks less awkward to me...:)

I was worried that people that can read Chinese will get more confused than those who can't. Those who can't will hopefully just see the Chinese chars as placeholder for something (which it is).

udosuk wrote:But memory-wise, because each Unicode Chinese character is worth 2 ASCII-chars i.e. 16 bits, you're practically wasting 7 bits per cell, which isn't too efficient IMHO...

If those 7 bits are all it takes, I'm willing to sacrifce them gladly ;-)

udosuk wrote:If you're only worrying about presenting this to the users in a human-readable way, it works well for me, but I'm not sure about others...:?:

Nice to hear that you like it.

My two main concerns for the encoding are the communication of sudokus between sudoku applications fed by user-interaction (i.e., copy-paste) and backwards comatibility for the "81 character to sudoku" parsers (this includes humans ;-).

udosuk wrote:Notice the English chars and Chinese chars are displayed in a different width on screen, so you probably should also use the Chinese full-width chars for the digits and the dots/zeros...

This would kill all backwards compatibility. I would propose not to format the string representing the sudoku directly...
Last edited by Stephan Heuscher on Mon Feb 12, 2007 5:16 pm, edited 1 time in total.
Stephan Heuscher
 
Posts: 11
Joined: 19 February 2006

Postby Stephan Heuscher » Mon Feb 12, 2007 7:19 pm

daj95376 wrote:...
Q: How do we add elimination information to the second sequence?
A: By replacing <0> with uppercase and lowercase letters {a..Ia..i} in cells with eliminations.

Code: Select all
                                                         *      *** *
                                                         7      447 4
576492381002003569930605742100509627069120435200006918090G612506DDg2D0890320950176

Uppercase letters always start eliminations for a cell. If more than one elimination exists for a cell, then the additional eliminations are represented by lowercase letters.

Only cells with multiple eliminations will ever increase the length of the sequence. I don't claim this approach is suitable for all PMs. However, it may be suitable for the vast majority.

Note: I have no idea why anyone would prefer a compressed PM format over the original, multi-line format. There's no way to *mark* the PM with other information.

That's a great idea!

The only reason I won't go implementing it is that it's not backwards compatible; "old" 81 character parsers will stumble.

In terms of readability your solution is better than mine.
Stephan Heuscher
 
Posts: 11
Joined: 19 February 2006

Postby udosuk » Tue Feb 13, 2007 4:26 am

I need some time to digest your concept of "backwards compatibility"...

Just want to comment that I can easily copy and paste the Unicode Chinese characters because I set up my system to be able to read Chinese... I'm not sure everybody use the same configuration and they could only see "question marks" when they try to copy & paste the Chinese chars from, say a browser to a text editor...:idea:

Danny's scheme looks good to me... I'm sure it's not too difficult to write a parser to backward decode the sequence of text back to the sudoku state... It's not easy to work out an upper bound for the length of these sequences though (100? 200?)...
udosuk
 
Posts: 2698
Joined: 17 July 2005

Postby Smythe Dakota » Tue Feb 13, 2007 5:20 am

If you want backwards compatibility, just use the first 81 characters the "old" way, and put any extra information after that. Odds are "old" software will just ignore anything past 81. If you want to increase your odds even a little more, put an end-of-file marker (ASCII code 26) at position 82. (Or at 84, with a CR/LF at 82 and 83.)

The candidates should be easily taken care of with an additional 162 bytes (81 byte pairs). The scheme could be something like this:

First byte of each pair:
0 -- None of the following (or the cell is a given).
1 -- 1 is a candidate.
2 -- 2 is a candidate.
3 -- 3 is a candidate.
4 -- 4 is a candidate.
5 -- 5 is a candidate.
A -- 1 and 2 are candidates.
B -- 1 and 3 are candidates.
C -- 1 and 4 are candidates.
D -- 1 and 5 are candidates.
E -- 2 and 3 are candidates.
F -- 2 and 4 are candidates.
G -- 2 and 5 are candidates.
H -- 3 and 4 are candidates.
I -- 3 and 5 are candidates.
J -- 4 and 5 are candidates.
K -- 1, 2, and 3 are candidates.
L -- 1, 2, and 4 are candidates.
M -- 1, 2, and 5 are candidates.
N -- 1, 3, and 4 are candidates.
O -- 1, 3, and 5 are candidates.
P -- 1, 4, and 5 are candidates.
Q -- 2, 3, and 4 are candidates.
R -- 2, 3, and 5 are candidates.
S -- 2, 4, and 5 are candidates.
T -- 3, 4, and 5 are candidates.
U -- 1, 2, 3, and 4 are candidates.
V -- 1, 2, 3, and 5 are candidates.
W -- 1, 2, 4, and 5 are candidates.
X -- 1, 3, 4, and 5 are candidates.
Y -- 2, 3, 4, and 5 are candidates.
Z -- 1, 2, 3, 4, and 5 are candidates.

Second byte of each pair:
0 -- None of the following (or the cell is a given).
6 -- 6 is a candidate.
7 -- 7 is a candidate.
8 -- 8 is a candidate.
9 -- 9 is a candidate.
a -- 6 and 7 are candidates.
b -- 6 and 8 are candidates.
c -- 6 and 9 are candidates.
d -- 7 and 8 are candidates.
e -- 7 and 9 are candidates.
f -- 8 and 9 are candidates.
g -- 6, 7, and 8 are candidates.
h -- 6, 7, and 9 are candidates.
i -- 6, 8, and 9 are candidates.
j -- 7, 8, and 9 are candidates.
k -- 6, 7, 8, and 9 are candidates.

That'll do it, compatibly, in 243 bytes (246 if you throw in a CR/LF and EOF).

Bill Smythe
Smythe Dakota
 
Posts: 534
Joined: 11 February 2006

Postby Stephan Heuscher » Tue Feb 13, 2007 10:34 pm

Smythe Dakota wrote:If you want backwards compatibility, just use the first 81 characters the "old" way, and put any extra information after that. Odds are "old" software will just ignore anything past 81. If you want to increase your odds even a little more, put an end-of-file marker (ASCII code 26) at position 82. (Or at 84, with a CR/LF at 82 and 83.)

The candidates should be easily taken care of with an additional 162 bytes (81 byte pairs). The scheme could be something like this:

First byte of each pair:
0 -- None of the following (or the cell is a given).
1 -- 1 is a candidate.
2 -- 2 is a candidate.
3 -- 3 is a candidate.
4 -- 4 is a candidate.
5 -- 5 is a candidate.
A -- 1 and 2 are candidates.
B -- 1 and 3 are candidates.
...


It's a great idea to put the extra information at the end with some kind of separating character. I'm just not quite sure how to encode the deleted numbers... I'll think about it, implement it and then report back here:)

Thanks a lot to all of you for helping me on this!

Stephan
Stephan Heuscher
 
Posts: 11
Joined: 19 February 2006

New human readable and backwards compatible encoding

Postby Stephan Heuscher » Thu Feb 15, 2007 9:29 pm

Taking the idea from Smythe Dakota, i added a few sprinkles of readability and got:
Code: Select all
00001203400005607800-000000000-00000410000085290000067000000000120039000980047000 -34-189


Every "-" in the first string indicates that numbers were deleted from the sudoku manually and it references the numbers by index in the second string (having a "-" as separator). Therefore, number-cell of the first "-", the numbers 3 and 4 were deleted; in that of the second "-" the numbers 1, 8, and 9.
Code: Select all
                    -34       -189
00001203400005607800-000000000-00000410000085290000067000000000120039000980047000

Besides being human readable, this form of encoding also allows for other than 9x9 sudokus to be encoded.

What do you think of this encoding?

Stephan
Stephan Heuscher
 
Posts: 11
Joined: 19 February 2006

Postby Smythe Dakota » Sat Feb 17, 2007 7:25 pm

Well, I guess how you encode it depends on what you want.

Originally, I had thought the original poster wanted a way to do it in 81 bytes or fewer, so I devised an incomprehensible bit-bucket nerdy way to do that.

Then, backward compatibility became the objective, so I came up with a more readable way to do it in 243 bytes.

However, there still might be a compatibility problem, as well as a readability problem, so here's yet a third possibility. It'll take a lot more than 243 bytes, though.

First 81 bytes: the regular puzzle, in the usual way. Then, a CR/LF.

Next, a new line for each row. The first row might be:

1247+489+3+258+579+19+467+56789+149 plus CR/LF

-- with a plus sign (or some other symbol) separating each cell from the next.

Filled-in cells could be denoted with single digits (like the 3 above) or by a zero. Zero could also be used to indicate no candidates (illegal puzzle).

If this proliferation of digits is at odds with backward compatibility (because some software may expect exactly 81 digits), replace the digits (except the initial 81) with letters: A=1, B=2, .... I=9, Z=0.

If the plus signs cause a problem, eliminate them, and instead follow somebody's earlier suggestion to use uppercase and lowercase: AbdgDhiCBeh etc.

I guess it's all a question of backward compatibility vs readability vs file size. A scheme like the above would typically take 500 bytes or so.

Bill Smythe
Smythe Dakota
 
Posts: 534
Joined: 11 February 2006

Postby ronk » Sat Feb 17, 2007 8:30 pm

Has anybody proposed appending udosuk's suggestion of three octal digits per cell to the "standard" one line format?

Using daj95376's puzzle as an example ...
daj95376 wrote:
Code: Select all
+----------------+----------------+----------------+
| 5    7    6    | 4    9    2    | 3    8    1    |
| 48   148  2    | 78   178  3    | 5    6    9    |
| 9    3    18   | 6    18   5    | 7    4    2    |
+----------------+----------------+----------------+
| 1    48   348  | 5    348  9    | 6    2    7    |
| 78   6    9    | 1    2    78   | 4    3    5    |
| 2    45   3457 | 37   347  6    | 9    1    8    |
+----------------+----------------+----------------+
| 478  9    478  | 38   6    1    | 2    5    34   |
| 6    15   15   | 2    37   47   | 8    9    34   |
| 3    2    48   | 9    5    48   | 1    7    6    |
+----------------+----------------+----------------+

... the block would look like ...
Code: Select all
576492381..2..356993.6.57421..5.9627.6912.4352....6918.9..6125.6..2..89.32.95.176
020100040010400002004200001210211002300301004020040400400004201040201020100010002
001210214020214400040002100300040400001002300010004020002030134101114040400001200
310400310204040001002020014040021021002101110200400014004002210400020210001100040

... and always have 324 characters (not counting newlines (CR-LF)). The 1st 81 digits would be the starting grid, and the remaining 243 characters would be the pencilmarks. [edit: In the 1st 81 characters, the periods could optionally be replaced by zeroes, I suppose, but the periods would help locate the givens in a multi-puzzle file.]

Don't be surprised if you find errors in the above, as I did the conversion manually.
Last edited by ronk on Sat Feb 17, 2007 4:44 pm, edited 1 time in total.
ronk
2012 Supporter
 
Posts: 4764
Joined: 02 November 2005
Location: Southeastern USA

Postby daj95376 » Sat Feb 17, 2007 8:34 pm

Stephan, If you want backward compatibility with many solvers:

Code: Select all
00001203400005607800X000000000X00000410000085290000067000000000120039000980047000
#~ 34 189

576492381002003569930605742100509627069120435200006918090X612506XX2X0890320950176
#~ 7 4 47 4
daj95376
2014 Supporter
 
Posts: 2624
Joined: 15 May 2006

Postby r.e.s. » Sun Feb 18, 2007 12:49 am

Smythe Dakota wrote:The candidates should be easily taken care of with an additional 162 bytes (81 byte pairs). The scheme could be something like this:

First byte of each pair:
0 -- None of the following (or the cell is a given).
1 -- 1 is a candidate.
2 -- 2 is a candidate.
3 -- 3 is a candidate.
4 -- 4 is a candidate.
5 -- 5 is a candidate.
A -- 1 and 2 are candidates.
B -- 1 and 3 are candidates.
[...]

I'd been thinking along similar lines ... Something more like the following scheme might be somewhat neater (though equally inefficient in the final character-encoding): Use 10 bits for each cell, coding it as two 5-bit blocks, each block being represented by a symbol in a 32-symbol alphabet such as 0-9a-v:
Code: Select all
block-value:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
block-symbol: 
0 1 2 3 4 5 6 7 8 9 a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v

For example, a candidate grid that begins 1,8,59,246,3457,23567,2345679,..., where the 1, but not the 8, was given as a clue, would be encoded as a 162-symbol string that begins g1408g1a2s3mbu...

                  bits
cell content   X9876 54321  code
------------   ----- -----  ---- 
'1'(clue)      10000 00001   g1
'8'(nonclue)   00100 00000   40
'59'           01000 10000   8g
'246'          00001 01010   1a
'3457'         00010 11100   2s
'23567'        00011 10110   3m
'2345679'      01011 11110   bu

where the X-bit flags clue/nonclue.

With a one-byte character representing each 5-bit symbol, every candidate grid is neatly (though not efficiently) encoded as a unique string of 162 bytes (81 pairs).
r.e.s.
 
Posts: 337
Joined: 31 August 2005

Postby r.e.s. » Sun Feb 18, 2007 11:47 pm

Stephan,

For ease-of-interpretation by humans, your currently implemented scheme is nice, but there is a problem with it (assuming the aim is to represent the complete state of the puzzle at all times):

In your scheme, a '0' in the first 81 characters is what identifies a cell that had no given clue; then, as candidates are deleted in such a cell, the '0' becomes a '-' and in the post-81 portion of the string a corresponding '-' is followed by the deleted digits. The problem is that when such a cell is solved, the '-' is being replaced by the solution digit and the corresponding deletion info is removed from the post-81 portion.

Thus an essential distinction between a solved digit and a given clue digit is lost, and the complete state of the puzzle is no longer represented. A better idea would be to never remove any '-' in the first 81 characters, and when a cell is solved just change the corresponding '-' in the post-81 portion to a '+' (say) followed by the solution digit.

EDITS: Removed unnecessary references to uniqueness-strategies.
Last edited by r.e.s. on Mon Feb 19, 2007 3:29 am, edited 3 times in total.
r.e.s.
 
Posts: 337
Joined: 31 August 2005

Postby udosuk » Mon Feb 19, 2007 4:28 am

Ron, thanks for bringing my octal scheme into the scene...:)

I notice we use a different system: you take 1 as the rightmost bit and 9 as the leftmost, where I use the opposite...
So my encoding for Danny's grid would be:
Code: Select all
020 004 010 040 001 200 100 002 400
042 442 200 006 406 100 020 010 001
001 100 402 010 402 020 004 040 200
400 042 142 020 142 001 010 200 004
006 010 001 400 200 006 040 100 020
200 060 164 104 144 010 001 400 002
046 001 046 102 010 400 200 020 140
010 420 420 200 104 044 002 001 140
100 200 042 001 020 042 400 004 010

020004010040001200100002400042442200006406100020010001001100402010402020004040200
400042142020142001010200004006010001400200006040100020200060164104144010001400002
046001046102010400200020140010420420200104044002001140100200042001020042400004010

With the basis of this octal system, we can use this "18 bits for each pair of cells, 64 symbols" scheme:

We can use 0..9 and then A..Z, a..z, %, & as the 64 symbols...

So A..Z represent 10..35, a..z represent 36..61, % represents 62, & represents 63...

Then 3 of these symbols can represent 2 cells, and Danny's grid can be encoded as:
Code: Select all
20410W0A0802W0YaI00q680G101090WG8WGG0WWG404HY21Y088G040m80C0G
0641022061q8Xa101W024m14n2140G0GC08Y4GG144W209W8204G120YW0410

A total of 122 characters... Some improvement on the original 243-char scheme...:)

BTW I like the "10 bits for each cell, 32 symbol" scheme from r.e.s. better, because of the niceness of having 1 extra bit for each cell denoting if it's an initial given or not... However if I have a choice I'll use ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_` as the 32 symbols because of their continuity on the ASCII table...:idea:
udosuk
 
Posts: 2698
Joined: 17 July 2005

Postby r.e.s. » Mon Feb 19, 2007 5:52 am

udosuk,

Note that we could (rather awkwardly) encode the complete state in 135 bytes, because a 10-bits-per-cell encoding amounts to 81*10=135*6 bits: Just run together all 81 10-bit blocks, and read the resulting 810 bits as a sequence of 135 6-bit blocks, each block to be represented by a symbol in an alphabet of 64 printable characters (such as the alphabet you're using). Some fancy compression methods could do better, but they wouldn't be as simple.

Concerning the choice of printable alphabet (whether 32- or 64-character)... I guess it matters little, but starting it with 0-9a-f (or maybe better 0-9A-F) does have the advantage of making it the same as hexadecimal in that part of its range.

I think neither of our schemes is as human-friendly as Stephan's current method (modified as I proposed to code the complete state). Unlike a 10-bits-per-cell fixed-length code-string, that method doesn't explicitly encode all the candidates, but instead (if done properly) codes just enough info per cell sufficient to deduce the complete grid from the game rules.
r.e.s.
 
Posts: 337
Joined: 31 August 2005

Postby Smythe Dakota » Tue Feb 20, 2007 8:43 am

Let's forget about file size, and concentrate on readability and backward compatibility.

Even on a diskette, a file uses up a minimum of 512 bytes of disk space, i.e. the remaining diskette space is reduced by 512 bytes even if the file is shorter than that.

On a hard drive or CD-ROM, this "unit size" is even greater, probably 2048 at least.

So we can afford to splurge. If the original puzzle is

Code: Select all
1  -  -     4  -  -     7  -  -
-  2  -     -  5  -     -  8  -
-  -  3     -  -  6     -  -  9

4  -  -     7  -  -     1  -  -
-  5  -     -  8  -     -  2  -
-  -  6     -  -  9     -  -  3

7  -  -     1  -  -     4  -  -
-  8  -     -  2  -     -  5  -
-  -  9     -  -  3     -  -  6


-- (this is probably not a legitimate puzzle) -- then it can be represented readably as follows (before any candidates are eliminated):


100400700020050080003006009400700100050080020006009003700100400080020050009003006


| 100000000 | 123456789 | 123456789 | 000400000 | 123456789 | 123456789 | 000000700 | 123456789 | 123456789 |
| 123456789 | 020000000 | 123456789 | 123456789 | 000050000 | 123456789 | 123456789 | 000000080 | 123456789 |
| 123456789 | 123456789 | 003000000 | 123456789 | 123456789 | 000006000 | 123456789 | 123456789 | 000000009 |

| 000400000 | 123456789 | 123456789 | 000000700 | 123456789 | 123456789 | 100000000 | 123456789 | 123456789 |
| 123456789 | 000050000 | 123456789 | 123456789 | 000000080 | 123456789 | 123456789 | 020000000 | 123456789 |
| 123456789 | 123456789 | 000006000 | 123456789 | 123456789 | 000000009 | 123456789 | 123456789 | 003000000 |

| 000000700 | 123456789 | 123456789 | 100000000 | 123456789 | 123456789 | 000400000 | 123456789 | 123456789 |
| 123456789 | 000000080 | 123456789 | 123456789 | 020000000 | 123456789 | 123456789 | 000050000 | 123456789 |
| 123456789 | 123456789 | 000000009 | 123456789 | 123456789 | 003000000 | 123456789 | 123456789 | 000006000 |


This takes 1090 bytes. The first 81 are the usual, followed by three CR/LFs (triple space) for readability. The remaining lines (one line per row) are the candidates, separated by spaces and the vertical bar | character. Each line ends with CR/LF. Bands are separated by an additional CR/LF.

This scheme even allows Notepad (or a similar text editor) to be used as the solving tool. As an additional benefit, candidates in the same column will line up vertically.

For even better readability, replace all the 0's with spaces (except in the initial 81), assuming a monospace (Courier) font is being used. Also, triple ||| bars insteads of single bars could separate the stacks.

Solved cells would be distinguishable from givens because the former would still be 0's in the initial 81-digit string.

Bill Smythe
Smythe Dakota
 
Posts: 534
Joined: 11 February 2006

Previous

Return to General