C programmers using large tables and looking for high speed should definitely be aware of the cache properties of the machine they're using. Understanding your memory access pattern, trying to rearrange it to avoid random access, aligning data to cache line boundaries, can all be done in C. The sorting we used for 4x3 counting made by far the most differrence in speed, and it was all written in C.
Some C compilers have a means for doing SSE2 operations in C, using intrinsic function calls. I know Microsoft Visual C++ does, I don't know about others, and I've never used them since I prefer just writing assembly. I would consider using SSE2 when your computation can benefit from SIMD parallel computation, i.e. operating on 1, 2, 4, or 8-byte integers, or 4 or 8-byte floats, in parallel. This does not work well when the data have to be fetched from random locations, but when they are packed in sequential locations the performance can be great.
I would probably forget about non-temporal stores and prefetch. Software pipelines are only useful in assembly language.
Note that most compilers make writing an inner loop in assembly language, imbedded in the C program and with access to C data, rather painless. Writing a short inner loop may be a nice project for the more adventurous C programmer.
I don't know about GCC and -O9, but do let it optimize as much as it can. I use MS Visual C++, and I'm generally impressed with the code it generates.
I almost never use a profiler; I can't remember the last time I used one. First choice for bottlenecks is to step back from the computer and think about what's going on. This works well when you have a good mental model of what the CPU is actually doing. You'll note that all of my explanations of speeding up 4x3 are based on an analysis of memory and CPU characteristics, not on emperical timing data. Of course it has taken me three decades to get to this point, and there is still quite a bit I can't figure out about what's going on inside the P4.
Second choice, for me anyway, is to measure execution times. It does help to know where to look, since it's impractical to time everything (I guess that's what a profiler does, but you have to learn how to use it, and you use it so infrequently that you have to relearn each time). clock() is good for timing relatively slow things; on my machine it has a resolution of 1 ms, so timing events less than 10 ms is unreliable. I use rdtsc because it has CPU clock tick resolution, and can be used for timing very short events, like one iteration of an inner loop. I wrap rdtsc in a C++ class for convenience.
Keep in mind that the computer is always doing other things besides running your program, so measured times can vary. One approach is to run the code say 100 times and get an average, but of course this takes 100 times longer. When I want to time code I make sure to close every other program, and I will go so far as to unplug the Ethernet cable so there won't be any network traffic to respond to. I'm not sure if this really helps, but it's easy to do.