OK, the details from your latest post are quickly focusing to the essentials.
Partitioning the catalogue by band has some advantages and disadvantages which I prefer to skip. Let assume we are working with the whole catalogue.
The most important thing in this data model (beyond the puzzles generation) is obtaining the grid index, right?
My point is that the binary search isn't effective for huge files, but is very effective for chunks that fit in RAM.
You can organize process so that the data (grid + puzzle size + puzzle exemplar) are locally kept in a smaller ordered in-memory container, sort of std::map<grid, <std::pair<int, puzzle>>.
When the data size exceeds the RAM, then it can be dumped to a file. At this point we still don't know the grid index.
Then you can
- combine several files in a larger one on single pass, see
https://en.wikipedia.org/wiki/Sort-merge_join. The GNU Coreutils (available for windows) command for merging is "sort -mu", where -mu is merge, unique.
- find differences between files using command "comm". Applicable for determining the newly found records, those that have no corresponding records in the older (already processed) file.
- find grid indexes and update the global puzzle catalogue on a single pass by implementing merge-join in your code.
I know that after my suggestions for UA scanning
, you will try to do exactly the opposite of what I wrote here. On your place I would do the same.
Some old code for working with collections of compressed grids can be found
here.
One possibility to encode a puzzle (of a known size and known solution grid) is to write 80-bit map of the givens, determining the 81-st by size of the puzzle. The code in the above link uses 80 bits per grid, and one of them is free too.
In general that covers all that I wanted to contribute.