I've made a start on
Project LCT. The first stage is "LCT-20", which will look at the existence of 20-clue puzzles. LCT-20 started with a copy of the ED grid catalog. When we find a 20C puzzles on an ED grid, we replace the grid with this representative puzzle. A status code identifies the number of clues in the puzzle, and whether this number is known to be a minimum.
The very first update to the LCT catalog replaced all the known 17C grid entries (46,300) with a 17C puzzle and marked with status = "17X", indicating that a 17C puzzle exists, and we know that no 16 puzzle exists. When we find a 20C puzzle for a grid we don't already have a result for, it is assigned status code "20C", which means this puzzle exists (obviously) but is not necessarily the minimum clue count possible for this grid.
We also have, thanks to
blue (and confirmed by
dobrichev), 4 grids that have known 21C, but have been proven to have no 20C puzzles. These 4 are registered with status "21X".
The objective of LCT-20 is to identify for ALL unresolved grids whether or not a 20C puzzle exists, to register the puzzle exemplars for all those found to have one, and finally to identify the lowest number of clues for the (hopefully few) grids that remain unresolved. If 21C puzzles can be found for all of these then the hypothesised result of "LCT = 21" is proven, and we will have the evidence to back that up.
We will also have what I hope will be a useful resource in the form of the LCT catalog itself.
Finding 20C PuzzlesThis can be done in 3 ways that I know of:
- Morphing: this is the fastest method of finding 20C puzzles, by morphing existing ones. A "new grid" is identified when a morphed puzzle solution yields a CF form that is new to the catalog. Very effective, but diminishing returns apply, as the yield ( the % of morph results that find new grids) degrades over time.
- Reduction: this tests individual grids, by repeated random reduction of the solution grid to a minimal puzzle state, until a puzzle of the desired size is found. For 20C/21C puzzles this typically takes, on my system, a few seconds per grid.
- HSE: Hitting Set Enumeration, this is the slowest method of testing a specific grid, but has the advantage of rigorousness - it can definitively prove whether or not a 20C puzzle exists. Method of last resort for individual grid status resolution.
Current StatusI have 4 x worker processes running on each of my 2 PC's. Each worker uses morphing to find as many 20C grids/puzzles as possible. These are grouped together in batches, each batch containing a set of 20C ED grids+puzzles, sorted and with duplicate grids removed. The worker doesn't actually check the catalog, so these are not necessarily new grids. The new grid yield only becomes apparent when batches are processed by the catalog update process.
The update process is BAND driven - the catalog has a separate file for each band, and the largest band can be wholly loaded into available RAM, so the method I have adopted is to process a set of batch files (currently 64) in parallel, loading/updating each band in sequence. (The update process is probably sub-optimal, and I am still trying to improve it, but optimising IO-bound processes is a tricky proposition, particularly so on Windows).
Currently one update pass takes about 3-4 hours. For a database that is effectively 420Gb, that's probably reasonable enough. Unfortunately the process takes much longer when worker processes are running. Currently the workers are doing lots of IO themselves, and this seems to seriously affect the time taken for the update process to write the new band contents. I'm working on a new worker process that will use much less IO, but meanwhile I have to stop the workers whenever I want to do an update.
Nevertheless, despite these problems I am getting yields of 100 million new grids identified each day (initially it was around 150 million), and at the last update had found 20C puzzles for
2.27 billion grids (41.5%).
A batch set (64 batches) typically has up to 300 million grids, of which on average 230 million are distinct within the set. The workers, happily, deliver mostly distinct grids wrt each other.
The net yield at update time for a batch set is currently 45%. But this is coming down with each set, of course. 5 days ago (7 sets) it was 67%.