Creating a balanced test set for Sudoku software

Programs which generate, solve, and analyze Sudoku puzzles

Creating a balanced test set for Sudoku software

Postby Clouded » Sun Apr 17, 2022 9:06 am

There are a number of machine learning projects/papers on Sudoku, but they all suffer from the same problem: they're evaluated on exceptionally easy datasets. (Sometimes datasets conssiting mostly of SE 0.0 puzzles.)

I've been training a neural net to solve Sudoku problems, but it's hitting 100% accuracy on all easy datasets. It solves about 50% of patterns game problems, but I want to have a more granular measure that tells me what it can/can't solve. Would it make sense to plot % solved for each different SE rating? That is, show that it solves n% of SE 1.0 problems, m% of SE 1.1 problems, etc..

One difficulty I'm hitting with that idea is that the distribution of difficulties is uneven. So e.g. if you look at the chart at the end, there are very few problems with the 5.1 rating. (I assume this is something to do with the distribution of the complexity rating mentioned at http://forum.enjoysudoku.com/post318934.html#p318934 , but I don't understand that well -- embarrassingly, my net is much better at Sudoku than me.)

I'd be grateful for any thoughts.

Image
Last edited by Clouded on Sun Apr 17, 2022 6:38 pm, edited 1 time in total.
Clouded
 
Posts: 18
Joined: 10 December 2020

Re: Creating a balanced test set for Sudoku software

Postby Hajime » Sun Apr 17, 2022 5:55 pm

Clouded wrote:One difficulty I'm hitting with that idea is that the distribution of difficulties is uneven

Why is that a problem?
I would be very interested howmany of the 2400 4.0 puzzles you can solve with neural net solving and howmany of the 20 5.1 puzzles. Better in percentage and is there a slope going down?
Just accept the fact that there are many more 4.0 puzzles than 5.1 puzzles. It is a SE fact.
User avatar
Hajime
 
Posts: 1377
Joined: 20 April 2018
Location: Fryslân

Re: Creating a balanced test set for Sudoku software

Postby Clouded » Sun Apr 17, 2022 6:06 pm

Hajime wrote:Why is that a problem?


Because the measurements for e.g. 5.1 will be extremely noisy. The net could just be 'lucky/unlucky' on the few problems that are in the mix.

I've just been looking at the possibility of augmenting the PG dataset with other problem sets to shore this up, but I'm nervous about mixing problems from different sources for other reasons.

I think what I may do is use the PG dataset but throw out any rating where there are <100 puzzles. [In practice that's 4.9, 5.1, 5.8, 5.9, 6.0, 6.1, 6.2, 6.7, 7.7, and >= 10.9 .]

BTW, this is how Hodoku difficulty ratings line up with SE (specifically, SukakuExplainer): https://i.imgur.com/dKnne8v.png

(How do I use the IMG tag and have that image inline at a sensible size?)
Clouded
 
Posts: 18
Joined: 10 December 2020

Re: Creating a balanced test set for Sudoku software

Postby Hajime » Sun Apr 17, 2022 6:15 pm

Clouded wrote:throw out any rating where there are <100 puzzles.

Keep them in, but color them red (too noisy).
Very interesting...

Hodoku rating is another issue. One problem at a time. (Is that English? :? )
User avatar
Hajime
 
Posts: 1377
Joined: 20 April 2018
Location: Fryslân

Re: Creating a balanced test set for Sudoku software

Postby Clouded » Sun Apr 17, 2022 6:35 pm

Hodoku rating is another issue.


It was less that I was interested in the Hodoku rating than that I was thinking of using Hodoku to generate puzzles with given SE ratings. (I had datasets of 1,000,000 problems at each of the Hodoku difficulties lying around, so I figured I might as well graph them.)

In general, most resources on the internet seem to have easy problems. E.g. here is Gordon Royle's 17-clue dataset: https://i.imgur.com/xoLGgh0.png
Clouded
 
Posts: 18
Joined: 10 December 2020

Re: Creating a balanced test set for Sudoku software

Postby Hajime » Sun Apr 17, 2022 6:54 pm

Champagne has a database of 3M+ puzzles with a SE>10.
Link is here:
http://forum.enjoysudoku.com/the-hardest-sudokus-new-thread-t6539-720.html?hilit=Database#p296155
User avatar
Hajime
 
Posts: 1377
Joined: 20 April 2018
Location: Fryslân

Re: Creating a balanced test set for Sudoku software

Postby m_b_metcalf » Sun Apr 17, 2022 6:58 pm

Clouded wrote:I think what I may do is use the PG dataset but throw out any rating where there are <100 puzzles. [In practice that's 4.9, 5.1, 5.8, 5.9, 6.0, 6.1, 6.2, 6.7, 7.7, and >= 10.9 .]

I'm confused about the origin of the data set you used in the histogram. In the Patterns Game, 4.0 has been played ~514 times and 6.7 ~561 times.

Regards,

Mike
User avatar
m_b_metcalf
2017 Supporter
 
Posts: 13631
Joined: 15 May 2006
Location: Berlin

Postby 1to9only » Sun Apr 17, 2022 7:07 pm

These might be of interest:

http://forum.enjoysudoku.com/post291343.html#p291343
In Patterns Game 384, I generated 10848 unique ER/EP/ED puzzles for this pattern.

http://forum.enjoysudoku.com/post291249.html#p291249
4353 unique ER/EP/ED puzzles submitted in the patterns games 1 to 385 inclusive.

The Patterns Game is now at 460, so the numbers will be higher but I've not produced a new list.

There are also a number of mismatched ratings in the PG puzzles list, I usually ignore these!
https://github.com/1to9only/patterns-game/raw/main/001-452-mismatches.txt
User avatar
1to9only
 
Posts: 4177
Joined: 04 April 2018

Re: Creating a balanced test set for Sudoku software

Postby Clouded » Sun Apr 17, 2022 7:08 pm

@Hajime, @1to9only, thanks!

@m_b_metcalf: First, my data is likely out of date; I downloaded the patterns games about 9 months ago and converted them to a suitable format, and haven't tinkered with them since. I have 41364 games, of which the last few are

100020003030405010004000600020030050500702008080050070002000700010206030300080001
800070003030206080002000400090030060200607009080020040001000500020704030300090004
600020008070103060009000300090030080200801003080070050004000500060702030700040006
100020003020304010005000600040010070700402006050030080002000900010209060800040001

Second, I've been using SukakuExplainer to rate the puzzles; unlike any of the other SE derivatives I could find, it will output all intermediate steps. It's possible that its ratings differ a little from the rater you use for the patterns game?
Clouded
 
Posts: 18
Joined: 10 December 2020

Re: Creating a balanced test set for Sudoku software

Postby m_b_metcalf » Sun Apr 17, 2022 7:17 pm

Clouded wrote: It's possible that its ratings differ a little from the rater you use for the patterns game?

But the difference between your counts for 4.0 and 6.7 and those of the PG (as examples) is an order of magnitude out. Something's wrong.

Mike
User avatar
m_b_metcalf
2017 Supporter
 
Posts: 13631
Joined: 15 May 2006
Location: Berlin

Postby 1to9only » Sun Apr 17, 2022 7:57 pm

m_b_metcalf wrote:But the difference between your counts for 4.0 and 6.7 and those of the PG (as examples) is an order of magnitude out. Something's wrong.

If you are using SudokuMonster's SukakuExplainer, (not sure about this) I think the default is the new techniques are enabled.
Skyscrapers are rated 4.0, so a lot of the PG puzzles will end up having a lower rating!!
User avatar
1to9only
 
Posts: 4177
Joined: 04 April 2018

Re: Creating a balanced test set for Sudoku software

Postby Clouded » Sun Apr 17, 2022 8:14 pm

I'm using the latest version (1.18.1) from here, if that helps: https://github.com/SudokuMonster/Sukaku ... r/releases . Version notes for v1.17 do indeed say 'Update default solving techniques'.

Edit: it says...

Code: Select all
The techniques are (in this order)
01: 1 Hidden Single
02: 1 Direct Pointing
03: 1 Direct Hidden Pair
04: 1 Naked Single
05: 1 Non-Consecutive Forcing Cell
06: 1 Locked Non-Consecutive
07: 1 Ferz Non-Consecutive Forcing Cell
08: 1 Ferz Locked Non-Consecutive
09: 1 Direct Hidden Triplet
10: 1 Pointing & Claiming
11: 0 Generalized Intersections
12: 1 Naked Pair
13: 0 Generalized Naked Pair
14: 1 X-Wing
15: 1 Hidden Pair
16: 1 Naked Triplet
17: 0 Generalized Naked Triplet
18: 1 Swordfish
19: 1 Hidden Triplet
20: 1 Scraper, Kite, Turbot
21: 1 XY-Wing
22: 1 XYZ-Wing
23: 1 WXYZ-Wing
24: 1 Unique Rectangle / Loop
25: 1 Naked Quad
26: 0 Generalized Naked Quad
27: 1 Jellyfish
28: 1 Hidden Quad
29: 1 3 Strong-linked Fishes
30: 0 Generalized Naked Quintuplet
31: 0 Generalized Naked Sextuplet
32: 1 VWXYZ-Wing
33: 1 Bivalue Universal Grave
34: 1 4 Strong-Linked Fishes
35: 1 Aligned Pair Exclusion
36: 0 5 Strong-Linked Fishes
37: 0 6 Strong-Linked Fishes
38: 1 UVWXYZ-Wing
39: 1 Forcing Chains & Cycles
40: 1 TUVWXYZ-Wing
41: 1 Aligned Triplet Exclusion
42: 1 Nishio Forcing Chains
43: 1 Multiple Forcing Chains
44: 1 Dynamic Forcing Chains
45: 1 Dynamic Forcing Chains (+)
46: 1 Nested Forcing Chains

Default --techs=1111111111010111011111111011100111100111111111
Clouded
 
Posts: 18
Joined: 10 December 2020

Re: Creating a balanced test set for Sudoku software

Postby denis_berthier » Mon Apr 18, 2022 2:11 am

.
Hi Clouded,
I'm not sure what you are looking for, but here is a large collection of puzzles (almost 6 millions), with their W ratings and the SER ratings of the first 3 millions: https://github.com/denis-berthier/Controlled-bias_Sudoku_generator_and_collection
It is not unbiased - no unbiased collection is known as of now (an unbiased collection could easily be extracted from it but it would be small).
However, the bias is known and it is therefore easy to use it for unbiased computations. Moreover, each sub-collection with a fixed number of clues n is unbiased among the minimal puzzles with n clues.
All this, plus associated statistical results, is explained in detail in my book [PBCS]: http://forum.enjoysudoku.com/pattern-based-constraint-satisfaction-2nd-3rd-eds-t32567.html
Last edited by denis_berthier on Tue Apr 19, 2022 6:12 am, edited 1 time in total.
denis_berthier
2010 Supporter
 
Posts: 4214
Joined: 19 June 2007
Location: Paris

Re: Creating a balanced test set for Sudoku software

Postby eleven » Mon Apr 18, 2022 7:29 pm

Hi Cloud,

years ago i wondered, if it could be interesting to use neural nets for solving sudokus, but - having very little insight in that topic - it did not make much sense for me.
So my first question is, what do you mean with "it can solve a puzzle". Does it just mean, that it spits out a solution or does it show a way, how it can be solved (i mean e.g. solving steps) ?
If you want to know, how good it is generally, only Denis' collection would be interesting, which is the only one which provides really random puzzles. Of course we know, that the majority of random puzzles is just trivial.
If you train it with e.g. the hardest puzzles collection, i assume, that at some point it will be better for them than for say only hard ones, but this depends on, how it does it.
So, whats your goal and what can be the results ?
eleven
 
Posts: 3164
Joined: 10 February 2008

Re: Creating a balanced test set for Sudoku software

Postby Clouded » Mon Apr 18, 2022 7:59 pm

@denis_berthier: thank you!

@eleven Big question! Hope you don't mind a somewhat long answer.

I am a researcher in cognition and my interest in artificial neural nets comes from that standpoint -- I want to understand how they mimic/differ from human cognition. I'm using Sudoku as a testbed to analyse that. Accordingly, I set up a net that solves Sudoku like a human would, one move at a time. You feed in a board state and it returns a board state with one more entry filled in.* You then feed that state back into the net and repeat N times, where N is the no. empty squares in the initial puzzle.

*This is a bit of a simplification -- it returns a probability distribution on possible moves for each empty square. I take the highest-probability move and throw away the rest.

Now, the net has to learn everything from first principles -- during training it gets given millions of Sudoku problems, and feedback on whether it makes correct moves or not. So it's not at any point being told what an X-wing is, or even given puzzles that specifically have X-wings in; it just gets faced with Sudoku and has to make progress somehow.

In order to create a 'fair' comparison point for the net, I tackled Sudoku myself the same way -- I solved the 320 problems in the book 'The Original Sudoku' from scratch, without looking up anything about standard solving techniques. [So e.g. I have no idea what an X-wing actually is -- please don't 'spoil' me yet as it will mess up my methodology.] Then I compared the net's step-by-step solutions to my own ones and asked questions like 'did we hit the same "bottlenecks"?' I.e. did we find we the same problem states 'hard'? The answer was yes -- you can see that in this image: Image


In terms of my goal, I wanted to point out to researchers that it doesn't mean much to say that e.g. a net gets 95% accuracy on the 17-clues problem set. It's just a trivial bar. It would be nice to have an alternative dataset to point people at, and after talking to you all I think that the Patterns Game dataset is a good one for covering the spread of difficulties. I should say though that I'm not currently trying to get the net to solve the hardest problems possible -- I want it to solve hard enough problems that there is some substance to analyse, but beyond that I'm aiming to have it be as small as possible, so that I can try to understand how it's solving problems. Right now it's better at Sudoku than me, which is good enough!


Edit: graagh, the image is huge again. What am I doing wrong?
Clouded
 
Posts: 18
Joined: 10 December 2020

Next

Return to Software