Yes, I wonder if properly placed prefetch instructions with properly designed data structures would increase the speed of C code close to assembly by making use of the fact that cached data is ALMOST as fast as registers.Shiru wrote: I don't say that you point is wrong, I just explain my point. If we have slow memory, to make code faster, we generally must minimize access to memory. Storage of most usable variables in registers is just an particular case of that. But we can't fit all needed variables in registers anyway, and there is other methods exists. We have cache, we can temporarily move global variables to register variables, until loop execution, etc. And when we minimize memory access in C code, we get same bottleneck as in equal assembly code - for loops it's usually reading of input data stream from slow memory.
Anywho, half the issue with figuring out what to keep in registers in assembly is also figuring out how to juggle that data around when you need the registers for something else. That's where the ART part of the Art of Assembly comes in play.
Assembly code won't magically make something ten times as fast - you have to be good at writing assembly as well. I rewrote PCx three times from scratch with three completely different CPU emulation methods before it was acceptable.