blue wrote:I would enjoy reading some time, some details about how you're dealing with such a large volume of puzzles.
The magic is the "merge" algorithm and the GNU tools
sort and
comm that use it.
I am using sorted files of canonical class representative of puzzles.
For {-n+m} operation the input is a puzzle list which during the reading is canonicalized and duplicates are removed, and the output is a file with mixed-size irreducible canonicalized puzzles of size (-n+1) up to (-n+m) tagged with the number of givens. I use the number of givens for faster later filtering by "grep" command.
I am setting +m so that eventual 40 or 41-clue puzzles to be found. At the final "+" stages the amount of the irreducible multiple-solution puzzles is small so that large +m don't affect the performance.
For the general 38-clue and 39-clue extension I use {+1} at a time and the following data model.
File family "35.doneup" ... "41.doneup" - ordered list of multiple-solution irreducible puzzles that have been checked "up" by adding one more clue in all possible ways. One puzzle per line followed by <tab> and one unused tag with the number of givens.
File family "36" ... "40" - ordered list of unique minimal puzzles of this size. Same format. I see that the file "40" is empty.
File family "35.doingup", etc. - a temporary file with newly discovered multilple-solution irreducible puzzles that should be checked "up". Prior to processing the large file is split into parts "35.doingup.part00", etc. in order to fit the maximal batch size. For 35 to 36 transfer, for 64 GB of RAM and 32 threads, I use batch size of 150 millions of puzzles which takes about 50 GB RAM, 9.5 hours processing and 15 minutes disk i/o.
The splitting command is "split -d -l150000000 35.doingup 35.doingup.part".
The tool produce ordered list of single-solution irreducible to the stdout and multiple-solution list to the stderr. the outpus is about twice as the input (12.75 GB input, 23 GB + 0.3 GB output).
Then the results are merged and here the ordered list takes place. The "sort" command has "merge" option and in this mode no sotrting is performed but several already sorted files are merged into one, optionally with duplicates removed. The command is "sort -m -u file1 file2 > outfile". Some Windows implementations of this command work well only when 2 input files are given and use temporary copies when more than 2 files are given. Linux implementation performs well doing all the job in one pass.
Then the results are split into "36.new" and "36.doingup" by removal of already processed/known puzzles from previous passes. Again the ordered lists take place. The command "comm" takes 2 ordered files and depending on parameters outputs puzzles from either first, second, or both files. It does it in one pass. In our case, "comm -13 knowns justfound > new" subtracts knowns from justfound producing new.
This the seed for the next step "up" is prepared.
The newly found unique puzzles are involved to "twin" check and {-n} transformations for the next pass.
For each size, the already used seed is subtracted from the newly obtained seed. At early stages (35s, 36s, and even 37s) a very small portion of the seed is duplicated so keeping strict track what has been already processed is probably not necessary. Repeating some amount of the processing could take less time than comparing the huge files, even on single pass.
This is the script for the 35 to 36 transfer that is currently running.
- Code: Select all
/dobrichev/tools/plus1/plus1 < 35.doingup.part00 > 35.doneup.part00.ss 2> 35.doneup.part00.ms
/dobrichev/tools/plus1/plus1 < 35.doingup.part01 > 35.doneup.part01.ss 2> 35.doneup.part01.ms
/dobrichev/tools/plus1/plus1 < 35.doingup.part02 > 35.doneup.part02.ss 2> 35.doneup.part02.ms
/dobrichev/tools/plus1/plus1 < 35.doingup.part03 > 35.doneup.part03.ss 2> 35.doneup.part03.ms
/dobrichev/tools/plus1/plus1 < 35.doingup.part04 > 35.doneup.part04.ss 2> 35.doneup.part04.ms
/dobrichev/tools/plus1/plus1 < 35.doingup.part05 > 35.doneup.part05.ss 2> 35.doneup.part05.ms
/dobrichev/tools/plus1/plus1 < 35.doingup.part06 > 35.doneup.part06.ss 2> 35.doneup.part06.ms
/dobrichev/tools/plus1/plus1 < 35.doingup.part07 > 35.doneup.part07.ss 2> 35.doneup.part07.ms
/dobrichev/tools/plus1/plus1 < 35.doingup.part08 > 35.doneup.part08.ss 2> 35.doneup.part08.ms
/dobrichev/tools/plus1/plus1 < 35.doingup.part09 > 35.doneup.part09.ss 2> 35.doneup.part09.ms
/dobrichev/tools/plus1/plus1 < 35.doingup.part10 > 35.doneup.part10.ss 2> 35.doneup.part10.ms
sort -m -u 35.doneup.part00.ss 35.doneup.part01.ss 35.doneup.part02.ss 35.doneup.part03.ss 35.doneup.part04.ss 35.doneup.part05.ss 35.doneup.part06.ss 35.doneup.part07.ss 35.doneup.part08.ss 35.doneup.part09.ss 35.doneup.part10.ss | comm -13 36 - > 36.new
sort -m -u 35.doneup.part00.ms 35.doneup.part01.ms 35.doneup.part02.ms 35.doneup.part03.ms 35.doneup.part04.ms 35.doneup.part05.ms 35.doneup.part06.ms 35.doneup.part07.ms 35.doneup.part08.ms 35.doneup.part09.ms 35.doneup.part10.ms | comm -13 36.doneup - > 36.doingup
Note on the last 2 lines that I am saving one huge file write/read using pipe. The "-" parameter is not well documented for the "comm" command but it works just like in "sort" command.
The next commands I will execute (after seeing everything goes OK) are to merge files 35.doneup with 35.doingup into 35.doneup.
- Code: Select all
sort -m -u 35.doneup 35.doingup > 35.tmp
rm 35.doneup
rm 35.doingup
mv 35.tmp 35.doneup
For stripping the tags (when I have to merge files with tags with files w/o tags) the command is "cut -f1 infile > outfile".
For adding tag with puzzle size and for sorting a possibly unsorted file, I relaxed "minus n" tool to accept 0 as parameter. In this mode it reads, canonicalizes, removes duplicates, and writes ordered puzzles with size tag.
It is paradoxical but the solver I used for {+} operations works only with pencilmarks, zeroing them when the cell is solved, and never knows the solution. The "twin" testing is done on another machine.