udosuk wrote:gsf wrote:back of the envelope: 5472730538 into 4.7Gi is < 1 byte per puzzle -- none of the discussions ever got that close
Thanks gsf... Very thoughtful calculation...
well, it turns out we can use a smaller envelope for our calculations ...
as this deals directly with minlex grids I'll hijack the duscussion here
if we're talking about compressing a large grid catalog (and I am: all essentially different grids)
then the discussion to this point has been much too local
let each grid be a minlex record of 81 data bytes, with all records sorted in ascending order
the difference between adjacent grids, especially in the lower bands, is very small on average
coding a compressor to take advantage of the differences can be a big win
burrows-wheeler (bzip2) does this, and does better than gzip, but doesn't take advantage
of the sudoku grid domain
i.e., there are both lexical and functional similatrities between adjacent grids
and off the shelf compressors for the most part only handle lexical similarities well
some of the sudoku specific similarities are
(1) first 28 minlex digits can be represented as 416 bands
(2) the remaining digits can be specified w.r.t. the basic sudoku constraints (~singles)
(1) is a sudoku specific lexical insight
(2) is a functional insight
I set up a coder to compress into windows of 1Mi
data compressed is <band index, #automorphisms, grid> for each grid
early bands compress well just using the sudoku specific lexical similarities and functional info
but it turns out that burrow-wheeler on the sudoku specific coding can get another factor of 2 for the lower (and more populated) bands
here are the number of bytes per grid for a few bands
sudz: sudoku specific coding, sudz+bw: sudz followed by (modified) burrows wheeler
- Code: Select all
bands grids sudz sudz+bw
006 96229042 1.66 0.74
100 27849953 1.92 1.21
150 11577852 2.17 1.44
300-416 2097068 3.51 2.78
more info later, including the sudoku specific coder, when the bands complete
sometime next week