Oh dear, I've been led into a trap, it seems!
My theory above was based on the (false) assumption that JCZSolver used "128-bit registers", eg SSE4.1. A comment in the code seemed to confirm this, so I didn't really look any further. I did wonder why it compiled cleanly without requiring "-march=native" or "-msse4.1" or similar.
But it doesn't use any 128-bit register code at all, it seems ...
Thus the version I just sent to Serg will probably have no effect whatsoever, and his performance mystery remains unresolved.
If anything, the mystery deepens. From 3x faster to nearly 2x slower is some kind of turnaround (!), and one that is really, really weird
[EDIT] Unless, of course, my GCC compiler is very much smarter than whatever was used to produce Serg's JCZSolver … but that seems rather far-fetched! Data sensitivity? fsss2 seems to be better at singles-only cases, but only by 15-20%. Perhaps when I see his actual benchmark puzzle set that may shed some light ?