There is article by Radek Pelane with different rating approach, not applicable to hardest puzzles.
I am reading Stuart's approach as "when you exhausted all solving techniques from your arsenal, what is the probability to choose the right cell for guessing". This works when applying known techniques takes negligible effort compared to guessing effort. Maybe more accurate results should include scaling to the number of the possibilities in each of the cells.
Pelane, from other side, estimates how easy you will find a technique from your arsenal that can be applied leading to eliminations.
Today I made some experimenting with different approach.
Ignoring the effort in applying the known techniques (singles, tried also with locked candidates), count the average guesses you should do, choosing only from the "best" guesses.
Best guess is one that is most probable to hit the correct value. Since we don't know what is the correct value for the cell, we are choosing (cell,value) pair following these criteria:
a) maximal probability == minimal number of candidates;
b) maximal direct interactions with other pencilmarks.
The a) criterion is calculated by simplification that the number of the candidates in the cell and the number of the positions of the value within the box, row, and column are independent and therefore the resulting probability is the product of the individual probabilities. In practice it is calculated by searching the maximal reciprocal of the product of the above 4 counts.
The b) criterion is pure artificial and its purpose is to limit the number of equally probable guesses. Without such limitation, walking on hardest collection with, say, 50 puzzles per second, each 2000th puzzle takes one hour to be completed just due to the large number of candidates with equal probability that have to be traversed.
All guesses with equal weight are processed, and the results are averaged. Finally an average depth of guessing is calculated, which is the rating value.
There is poor correlation with SE rating, which is not surprise since SE is focused on "hardest step knowledge" but I am focusing on average efforts, weighting every guess to 1, single eliminations to 0 and probability calculations to 0.
The method is not sensitive to VPT in any way. Puzzles solved by singles have rating 0. The method in some degree depends on the number of givens.
There are significant deviations in the rating proportions when small changes in techniques is made (+/- locked candidates), and when b) criterion is changed to similar one. This noise is making the entire approach not so promising.