Untitled 32X Super Scalar Project

Ask anything your want about the 32X Mushroom programming.

Moderator: BigEvilCorporation

Vic
Interested
Posts: 27
Joined: Wed Nov 03, 2021 6:01 pm

Re: Untitled 32X Super Scalar Project

Post by Vic » Thu Jan 06, 2022 6:48 pm

pw_32x wrote:
Thu Jan 06, 2022 6:34 pm
Is that left-right halves or top-bottom halves?
Left & right for the screen, top and bottom for sprites. I figured I'd get a higher cache hit rate from the top-bottom approach for sprites.
pw_32x wrote:
Thu Jan 06, 2022 6:34 pm
I don't think I can do software sprite rotation fast enough and I have doubts I'll be able to fit all the sprites and their tilted versions in ram.
If you can do scaling at acceptable performance level, then rotation shouldn't be much of a problem, it's almost the same thing really.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Untitled 32X Super Scalar Project

Post by Chilly Willy » Sat Jan 08, 2022 12:36 am

One thing you need for best speed on the SH2 side - keep the MD off the bus. Make sure the Z80 is held reset, the 68000 is running in work ram, and the VDP is not doing any DMA. Running the 68000 main() in rom is enough to maybe halve the speed of the 32X side.

Running sh2 functions from sdram makes them cache quicker. The sdram does burst reads - 8 words in 12 cycles. Nothing else burst reads, meaning the cache is much slower to load. Oh, and make sure you have the cache turned on and set to 4 way set associative. If you're not using the cache (have the cache in scratchpad mode) or you're running with the PC set to uncached pointers to functions, you aren't going to cache the code, which means it won't be fast on real hardware.

This also applies to data in the rom - use cache enabled pointers to data in the rom to cache the data. I'm not sure if this really needs to be said, but using a pointer with the caching bits of the pointer set to uncached space means the data will not cache - it will read from the rom every time it is accessed.

There are times when you WANT to read something as non-cached. For example, a variable set by the other cpu. Then you either need to flush the address of the variable before reading it, or to read it as uncached.

saxman
Interested
Posts: 19
Joined: Mon Sep 15, 2008 6:35 am

Re: Untitled 32X Super Scalar Project

Post by saxman » Sat Jan 08, 2022 4:51 am

Chilly Willy wrote:
Sat Jan 08, 2022 12:36 am
Make sure the Z80 is held reset
I've wondered about this for a while now. So if I'm interpreting you correctly, it sounds like unless you want to hinder the system's overall performance, using the Z80 for anything at all does more harm than good in the case of 32X development.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Untitled 32X Super Scalar Project

Post by Chilly Willy » Sat Jan 08, 2022 11:25 pm

saxman wrote:
Sat Jan 08, 2022 4:51 am
Chilly Willy wrote:
Sat Jan 08, 2022 12:36 am
Make sure the Z80 is held reset
I've wondered about this for a while now. So if I'm interpreting you correctly, it sounds like unless you want to hinder the system's overall performance, using the Z80 for anything at all does more harm than good in the case of 32X development.
You'd have to be careful about how you used the Z80. It would need to run completely within its ram, and make very few accesses to anything on the main MD bus. What can be done with the Z80 in a 32X game can easily be done by the 68000 instead. The SCD 68000 is rather more isolated from the MD, so you should be able to make more use of it than the MD 68000.

pw_32x
Interested
Posts: 19
Joined: Thu Dec 16, 2021 12:26 am

Re: Untitled 32X Super Scalar Project

Post by pw_32x » Sun Jan 09, 2022 1:07 am

Chilly Willy wrote:
Sat Jan 08, 2022 12:36 am
One thing you need for best speed on the SH2 side - keep the MD off the bus. Make sure the Z80 is held reset, the 68000 is running in work ram, and the VDP is not doing any DMA. Running the 68000 main() in rom is enough to maybe halve the speed of the 32X side.
Looking around marsdev toolkit, there's a bit in m68k_crt1.s that copies the 68000 main loop to work ram.
There're also a bit about halting Z80 and initializing the FM chip. And I'm not doing anything on the VDP on purpose so I should be ok on that front.
Running sh2 functions from sdram makes them cache quicker. The sdram does burst reads - 8 words in 12 cycles. Nothing else burst reads, meaning the cache is much slower to load.
So far I've moved drawing and dirty rect cleaning functions to SDRAM for minor speedups. I was hoping for something more dramatic :)
Oh, and make sure you have the cache turned on and set to 4 way set associative.
Looking at boot.s, it's setting the master/slave cache registers to 0x11 aka 00010001, which the docs says enables (bit 0) and purges (bit 4) the cache. Bit 3 is 0 which enables 4-way. So I think it's good on that end.
This also applies to data in the rom - use cache enabled pointers to data in the rom to cache the data. I'm not sure if this really needs to be said, but using a pointer with the caching bits of the pointer set to uncached space means the data will not cache - it will read from the rom every time it is accessed.
If the cache is activated, do I have to worry about this? If I have to worry about it, how do I tell if a pointer is cached or not?
There are times when you WANT to read something as non-cached. For example, a variable set by the other cpu. Then you either need to flush the address of the variable before reading it, or to read it as uncached.
How do I do that? Possible in C?

On another topic, how painful is it to read from video memory? Let's say I store sprites in the empty non-visible space. Is it more painful to read and draw into video memory (or even DMA copy?) than from SDRAM? Is it too slow to perform, say, basic transparency effects? Read a pixel, half the value, write it back, etc.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Untitled 32X Super Scalar Project

Post by Chilly Willy » Sun Jan 09, 2022 5:40 pm

pw_32x wrote:
Sun Jan 09, 2022 1:07 am
So far I've moved drawing and dirty rect cleaning functions to SDRAM for minor speedups. I was hoping for something more dramatic :)
If the code is small enough to cache completely and then is called in a loop without other functions being big enough also being called knocking it from the cache, you won't see much of a difference beyond the first time it's called and is first loaded to the cache. This is kind of an ideal condition for a main loop - small enough to cache completely, and not getting bumped from the cache by other code. Such code COULD be left in rom if you run low on sdram.
This also applies to data in the rom - use cache enabled pointers to data in the rom to cache the data. I'm not sure if this really needs to be said, but using a pointer with the caching bits of the pointer set to uncached space means the data will not cache - it will read from the rom every time it is accessed.
If the cache is activated, do I have to worry about this? If I have to worry about it, how do I tell if a pointer is cached or not?
The upper nibble of the pointer being 2 sets the pointer as uncached. That's why all the 32X hardware (from the SH2 perspective) has a top nibble of 2 - so that hardware is not dealing with the cache. For example, the color palette ram is 0x20004200. The actual SH2 address is 0x4200, and that is ORed with 0x20000000 to make an uncached pointer to it.

If you OR the pointer with 0x40000000, writing to the location will flush the address from the cache instead of actually writing anything. You would do this to flush a variable or struct or array from the cache. Note that it flushes the cache line containing the address. So the sixteen bytes holding the address wrap-around to keep it all in a 16 byte block. That can be important when you access common data between processors. You might make sure said data is 16-byte aligned and doesn't cross a 16 byte boundary or possibly only part of the data will be flushed, and the rest still in the cache.

As to whether you need to worry about it, if you look at the LD file, the memory in the 32X is marked like this

Code: Select all

MEMORY
{
    rom (rx) : ORIGIN = 0x02000000, LENGTH = 0x00400000
    ram (wx) : ORIGIN = 0x06000000, LENGTH = 0x0003FC00
}
Both the rom and ram default to cached addressing. If you wish to make a variable/data/array/whatever uncached, you will need to do something like this

Code: Select all

    int x, w;
    int *y;

    y = (int *)((int)&x | 0x20000000);
    *y = 0x12345678;
    w = *y;
There are times when you WANT to read something as non-cached. For example, a variable set by the other cpu. Then you either need to flush the address of the variable before reading it, or to read it as uncached.
How do I do that? Possible in C?
uncached_pointer = pointer | 0x20000000;

You'll notice in 32x.h, all the hardware pointers already have that 0x20000000 added in. They're usually also set as volatile. Volatile needs to be set for any reference that may be hardware (and hence changed at random according to what the hardware does), or if the referenced location can be changed by an interrupt handler.
On another topic, how painful is it to read from video memory? Let's say I store sprites in the empty non-visible space. Is it more painful to read and draw into video memory (or even DMA copy?) than from SDRAM? Is it too slow to perform, say, basic transparency effects? Read a pixel, half the value, write it back, etc.
Unused vram can be used how you like. It's even reasonably fast to read/write (compared to the cart). Just remember that byte writes of 0 are ignored. So I'd recommend not putting byte variables/data into vram if you are going to write it one byte at a time. DMA to/from vram works fine. My current Wolf32X draws to a buffer in sdram, then DMAs it to vram.

Yeti3D runs in 16-bit video mode, and does a transparent effect by reading the word pixel from vram, adding in the texture pixel, halving the result, and storing it. You do want to keep this kind of drawing to a minimum since there's barely enough time to just draw the data, much less mix it, if you want high FPS.

pw_32x
Interested
Posts: 19
Joined: Thu Dec 16, 2021 12:26 am

Re: Untitled 32X Super Scalar Project

Post by pw_32x » Sun Jan 23, 2022 8:33 pm

Started messing around with the second CPU. The info on the uncached pointer stuff was super helpful, thanks!

https://twitter.com/pw_32x/status/1485347144605257735

First thing I want to try out is splitting the sprite rendering across both CPUs. I hope I'll see a performance bump. Performance has slowly gotten lower lately.

pw_32x
Interested
Posts: 19
Joined: Thu Dec 16, 2021 12:26 am

Re: Untitled 32X Super Scalar Project

Post by pw_32x » Mon Jan 24, 2022 1:49 am

So I did a quick and dirty implementation of drawing the top and bottom half of the "game field" objects (spheres, player, trees) across both CPUs and the results were very encouraging. The game's frame rate jumped by 10 fps which really got my hopes up. I'll be splitting up some more work across both CPUs and be cleaning up the sprite drawing routines some more and hopefully I'll get even more performance improvements.

Related Twitter post:
https://twitter.com/pw_32x/status/14854 ... 11109?s=20

pw_32x
Interested
Posts: 19
Joined: Thu Dec 16, 2021 12:26 am

Re: Untitled 32X Super Scalar Project

Post by pw_32x » Tue Jan 25, 2022 1:47 am

Somewhat ill defined question:
Say I have a pointer to a struct whose fields point to other structs and I pass the pointer from one CPU to another to work on. On the second CPU, do I have to work uncached for all pointers if I want to be sure I have the updated values? Or just the "root" pointer?

I've started moving work from the main CPU to the secondary CPU and I want to be sure that the data the secondary works with is correct/up-to-date.

cero
Very interested
Posts: 338
Joined: Mon Nov 30, 2015 1:55 pm

Re: Untitled 32X Super Scalar Project

Post by cero » Tue Jan 25, 2022 8:28 am

Uncached on both cpus, or manual syncs between.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Untitled 32X Super Scalar Project

Post by Chilly Willy » Tue Jan 25, 2022 11:02 am

The SH2 does write-through on writing, so writes will always be in ram. So the other SH2 needs to flush its own cache of the range over all parts of the struct before reading it in order to get the current data. That's assuming the cache is used, and you aren't using uncached access to all parts of the struct. That is what I do in my code for dealing with sound mixing. All channels and other data are cached, and when the primary SH2 updates the list, the secondary SH2 flushes the entire entry from its cache before mixing all channels. Flushing is very easy on the SH2: you OR 0x40000000 with the pointer, write anything at all (the write is ignored, it's just a signal to the cache unit to flush the cache line), and the cache line at the pointer is flushed. If the struct is more than a cache line (16 bytes), add 16 to the pointer, flush the next line, and repeat over the size of the struct.

Note how I say Primary and Secondary for the SH2s - I used to use Master and Slave since that's how Sega refers to them, and that was how the whole industry dealt with things like that (like master and slave floppy drives, etc). However, I've seen people become completely unhinged over this use of language, and derail threads to rant about how slavery is evil. We know that. It doesn't apply to electronics. But rather than argue with someone who won't be agreeing to anything you say, I've decided to get ahead of the curve and call the two processors "primary" and "secondary". You'll find that in the Doom 32X Resurrection code. I'll be updating my 32X devkit the same way when I do the next update on it. Better to just avoid the crazies than to have long rants that detract from the discussion.

EDIT: Associative Purge constant is 0x40000000, not 0x60000000!
Last edited by Chilly Willy on Sun Jan 30, 2022 5:24 pm, edited 1 time in total.

cero
Very interested
Posts: 338
Joined: Mon Nov 30, 2015 1:55 pm

Re: Untitled 32X Super Scalar Project

Post by cero » Tue Jan 25, 2022 5:28 pm

Oh, didn't know it was write-through, thanks for the correction. And sad that you bowed to the crazies. I would have kept the words and banned the crazies.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Untitled 32X Super Scalar Project

Post by Chilly Willy » Tue Jan 25, 2022 9:49 pm

When I was younger, I'd have baited the crazies. As I've gotten older, I find it much less trouble to just think ahead and avoid the hassle. Choose your battles. I choose not to fight the idiots over such a minor matter of terms. Primary and secondary works just as well, and doesn't trigger the idiots.
:lol:

pw_32x
Interested
Posts: 19
Joined: Thu Dec 16, 2021 12:26 am

Re: Untitled 32X Super Scalar Project

Post by pw_32x » Wed Jan 26, 2022 2:59 am

I'm still iffy about when to uncache things. I'll describe the scenarios I currently have.

Scenario 1:
I have an array of sprites to draw. It's filled every frame, drawn, and then reset.
Timeline:
(main cpu): I go through every game entity in the scene and each entity fills the sprite array with the sprites it needs to draw for the frame.
(main cpu): call the scene's drawing function, which does:
(main cpu): For each sprite in the sprite array, I pass the sprite as a pointer to a function that draws the top part on the main cpu and then send the same pointer to the seconary cpu for it to draw the bottom half.
(main cpu): draws top part of a sprite
(secondary cpu): draws the bottom part of a sprite

If I want the data the pointer refers to to be current on the secondary cpu do I need to use the uncached version of the pointer, or do I flush it? or both?

Also, the sprite points to a struct of pixel data. The pointer to the pixel data is set every frame on the sprite struct, but pixel data itself doesn't change after being loaded. Do I need to use the uncached pointer to the pixel data too?



Scenario 2:

I have a global pointer to a game entity that manages the scaling cloud layer, DrawManager_skyEntity.

Timeline:
(main cpu): update the cloud layer state (timers, etc. No drawing yet)
(main cpu): in the scene's draw function, tell the secondary cpu to draw the cloud layer while we do other work
(secondary cpu): call the DrawManager_skyEntity's draw function, which will draw all the clouds.

Code: Select all

    // on the secondary cpu, we've received the message from the main cpu to draw the clouds
    Entity* skyEntity = (Entity*)((u32)DrawManager_skyEntity | 0x20000000); // necessary?
    skyEntity->draw(skyEntity, NULL);
Do I use an uncached pointer to skyEntity at that point?

The entity's drawing function looks like

Code: Select all

// on secondary cpu
static void draw(CloudsEntity* clouds)
{
    drawClouds(clouds->clouds, clouds->terrainZ); // helper draw function
}
And the helper function. The cloud layer is basically "terrain" like the tree layer, just different sprites.

Code: Select all

// on secondary cpu
void drawClouds(const Terrain* terrain, const SceneData* sceneData, fixedp cloudZ)
{
    // draw the clouds to screen
}
In this scenario, I don't know what I should be doing. Flush the clouds entity? The terrain?


Scenario 3:
I have an XY array of dirty areas that is filled when I draw sprites. Every element covers an 8x8 area of the screen. At the start of every frame, I go through the array and erase the corresponding areas of the screen. I divide the screen in two and each cpu clears its half.

On the secondary cpu, do I have to flush its half of the dirty rectangle array?

As you can see, I don't quite have a handle on it yet. Right now I know I'm doing it wrong. The game runs and there are performance improvements but I've noticed small graphical glitches that smell like the secondary cpu isn't working with up to date data. I want to figure out the proper rules of thumb for this. I appreciate the support!

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Untitled 32X Super Scalar Project

Post by Chilly Willy » Wed Jan 26, 2022 11:01 am

1) The sprite pointer and the struct it points to both need to be either uncached OR flushed (you don't need to do both, and the flushing is by the other CPU, not the main one). The pixel data doesn't need to be flushed, and should only be uncached if the draw routine floods the cache with pixel data. You can try it cached and uncached to see which is faster over all after you get it working.

2) The function itself does not need to be uncached, and indeed, making a function uncached will make it very slow as all instructions are fetched every time (which will make any loops in the function slow). Also, if the function is in rom instead of sdram, this will make the function horribly slow as it has to fetch from rom. In general, don't make code uncachable. As Vic mentioned, you might also play around with the optimization level to generate small code (-Os) that fits better in the cache.

Any global variables/structs used by the function probably need to be flushed (by the secondary cpu) or it will be using stale data. Those vars could also be uncached. Remember that if the secondary cpu sits in a loop inside that function looking at variables set by the main cpu, waiting for some kind of sync notification or command, that variable needs to be set as volatile so the processor knows to reload from memory rather than the previous load to a register. Also, each time through the loop, you'd need to flush the vars, or have them set as uncached. This particular paragraph is probably where 90 percent of cache coherency issues arise.

3) Yes. If the main cpu writes the x/y array, the half of the array the secondary reads must be flushed (by the sec) to read the current values.

Post Reply