First of all, sadly, most of the ideas presented here just will not work if you want to hit 60fps. In 256 colour mode, doing byte writes, it will take almost an entire frame just to do a straight copy from SDRAM to the frame buffer. So, one plane, and a few sprites is about the best you can expect.
The RAM speed doesn't really come into it. There is a FIFO in the VDP between you and the framebuffer, and writing constantly is going to get you one write every five cycles, at best. There's no way around that.
I could never work out if you can write faster during the VBLANK, the answer seemed to be no, but since the fastest you could hope for is three cycles per write, due to the RAM speed, the difference isn't huge anyway.
The upshot of this is approx 76000 writes per frame max, *IF* you can hit that 5 cycle per pixel, which isn't easy to do. Given that 320x224 is 71680 pixels, you can see it isn't possible to get much more than one plane this way.
You could up this to two planes if you do word writes. But that complicates scrolling, and flipping. Unfortunately the SH2 instruction set isn't exactly helpful here, and anything more complicated than that is going to make it very hard to hit that 5 cycles. Alternatively you could do one 32K colour plane, which would be much easier, but... kinda sucks.
You really don't have much time for decompressing/recolouring/caching or converting tiles on the fly. You might save a bit of bandwidth, but writing fast enough is the holy grail here.
DMA is a complete waste of time. It's no faster than the CPU at best, due to that 5 cycle thing again. It's only as fast as the CPU if you use the modes that require both source and destination to be on a, IIRC, 128 byte boundary, which is useless here - and unless you're going to DMA big chunks at a time, it's going to be way slower anyway.
It should be fairly obvious that writing to an external buffer, and then copying to the framebuffer, isn't going to be faster. Also, that means you're going to have to mask the layers, because you can't use the overwrite function for the second layer, or any sprites.
The best way to hit that 5 cycle per pixel limit is using both CPUs. The first one to write will lock for 5 cycles, the other will continue to run and fetch the next pixel, and eventually lock for 5 cycles when it comes to write - giving the first CPU time to fetch its next pixel, etc. If you time it right you can get max performance this way.
You could use longs - it won't write any faster, but it will mean you lock for 10 cycles instead of 5, giving you more time to prepare the next write.
The CPUs do not get in the way of each other anywhere near as much as everybody, everywhere, seems to keep saying. Most instructions in your loop will be coming from each CPUs own instruction cache, most data reads will come from each CPUs own data cache. They only contend when they both need to use the same bus, which, if you do it right, isn't often (this should happen while the other CPU is locked waiting for the VDP, anyway).
Obviously you don't need to clear the screen, you're going to overwrite the entire thing anyway. There may be mileage in checking, while writing the second layer, for blank tiles - and just skipping them - under the assumption that big chunks of the second layer are often empty. On the other hand, you don't want this to become a requirement...
This absolutely requires ASM, and mad optimisation skillz

Good luck, you've made me want to play with this again. Just need that damn dev cart...