I'm working on a top secret project where I am hitting a tough choke point with VRAM and Secondary VRAM speed.
Specifically, I am using double buffering in my render loop with these steps on alternating buffers (between VRAM and secondary VRAM)
1) Clear the colors
2) Render stuff
3) Begin DMA on current buffer to LCD
4) Switch buffers for next loop
So, issues I'm running into are:
1) If I just don't bother to double buffer, and do PutDisp for step 4, my render loop runs at 11.0 FPS. With double buffering, it runs at 11.6 FPS. Does the PutDisp call really only take 5 ms? Or am I screwing up my memory access/DMA somewhere?
2) The Clear the Colors step (in BOTH cases of using DMA for double buffering or not) is really oddly slow. Like it's taking up 30% of my frame time. So it takes about 20 ms for me to run memset(0) on a full screen buffer. This is using the optimized memset I worked out with Lephe. 30 CPU cycles per pixel! Does anyone know of a faster way to clear the screen? Are there caching properties I can fiddle with, etc etc?
I am open to anything, including using different RAM space. For NESizm and Prizoop I had to luxury of rendering scanlines so I pushed everything through on chip memory and avoided RAM speed entirely for rendering. Unfortunately in this case 8KB is too little contiguous RAM space.
Hi,
how do you do memset()? Can you post the code? 30 cycles per pixel is too much. I assume you work in RAM. You can set 2 pixels at once, because 1 pixel uses 2 bytes, but the CPU is 32 bit (= 4 bytes).
Assume VRAM address is dividible by 4 and number of pixels is dividible by 2:
void memset4(unsigned *p,unsigned v,unsigned count)
{while(count--) p[count] = v;}
You will call it:
memset4((unsigned *)GetVRAMAddress(),0,(WIDTH * HEIGHT) / 2));
Thanks for responding! So the stuff Lephe and I put together for the memset libc call does this automatically, and finds the 4 byte aligned boundaries and runs through several words at a time for efficiency.
A quick test of not doing the memset but just doing a loop like you suggest shows almost identical performance (actually like 1% slower than the optimized assembly)
This really does appear to be a result of the 'state' of the calculator. I'm betting an inefficient loop would even perform similarly, it doesn't seem to be CPU instruction bound at all.
tswilliamson wrote:
...This really does appear to be a result of the 'state' of the calculator. I'm betting an inefficient loop would even perform similarly, it doesn't seem to be CPU instruction bound at all.
What is your high level plan? Draw into one VRAM buffer and (in parallel) do DMA transfer of the other VRAM buffer? If yes, I think it might explain the slow memory access, the CPU shares the access to the RAM with DMA, so it works slower.
What does the PutDisp()? The whole DMA transfer of the image? Then you do not need 2 buffers...
I am out of Prizm development for long time, but I think I remember my grayscale video player, where I did a loop: draw to VRAM, DMA copy to LCD, wait a while to keep the speed, goto next frame. The video had 15 frames per second and it worked (was in sync with original). I do not remember the max speed, if the time sync was removed, I tested it for sure, but forgot the result.
It was in the early times, when we did not overclock the calculator.
BTW: what is the speed, if you skip memset() call? I assume you draw at least something.
Yeah pretty much all the cycles are gained back if I skip memset. The current plan was to try DMA on one buffer while I constructed the next buffer, yes, with the thinking that there would be some speed gain.
It seems like I can just try using burst mode DMA to avoid that hassle, I'll see if that does better than PutDisp().
In addition, I noticed that the DMA module lets you specify if the src address increments or not, so I can just fill 32 bytes with my memset value and try to use a DMA burst on that as well to do the screen clear and see if that is faster than memset.
tswilliamson wrote:
Yeah pretty much all the cycles are gained back if I skip memset. ...
Can you rearange the code? I assume your function is like:
1. memset()
2. compute something
3. draw something
And in the same time the DMA process is:
A. DMA transfer
B. nothing, because it is done
So your step #1 interferes with step #A. If you move the memset to the end as much as possible like:
1. compute all what can be computed in advance
2. memset()
3. draw
This way the CPU would not be blocked (so much) by DMA, as there would be lot of computing and not so much RAM access. The #2 memset might overlap with #B, so it would go faster.
Sorry, I am just thinking and I might be wrong.
Oh good idea, actually the drawing itself is computationally expensive itself. Even better with double buffering I can actually do this:
1) Render Buffer #1
2) Wait DMA (though it's almost certainly done at this point), Clear Buffer #2
3) DMA Buffer #1
4) Render Buffer #2
5) Wait DMA, Clear Buffer #1
6) DMA Buffer #2
This way the DMA is always happening during the more computationally expensive render phase, where I am tying up the CPU with a lot of mult instructions and so on.
Will try it tonight and get back.
Is this on the CE? If so, have you considered using the ldir instruction? Cycle count would be 2F+(1R+1W+1)*bc. In the case of clearing the screen, it would (probably) end up taking a lot less than 30cc per pixel.
ldir is awesome for z80 land! But not, this is for the Prizm
And the main guess at the issue appears to be that the RAM is being tied up by both processes, rather than it being a sheer instruction count issue.
tswilliamson wrote:
ldir is awesome for z80 land! But not, this is for the Prizm
And the main guess at the issue appears to be that the RAM is being tied up by both processes, rather than it being a sheer instruction count issue.
This comment isn't related to this thread, sorry.
I am a Prizm user but not a programmer. Don't know any code.
Do you know how to convert a source code folder to a .g3a prizm program? i have been following your work for the past few years, i think you know how to do this? Sorry if you have no time left.
Here is the link: https://mega.nz/folder/bRsmQI5C#efCskbOrITYTfMFqecP7uQ
You'll have to talk to gbl08ma about that one, looks like it is his code. And yes, please don't reply in this thread, create a new one in the forum for something like this!
tswilliamson wrote:
You'll have to talk to gbl08ma about that one, looks like it is his code. And yes, please don't reply in this thread, create a new one in the forum for something like this!
Ok, I will! Also gbl08ma i think has said that they stopped developing for the Prizm it's best not to ask them. It's just kind of rare to find active Prizm programmers i guess.
So circling back on this with you MPoupe, I've managed to make the process much faster with a DMA burst. I'm not sure of the implications of this, but everything speaks to that RAM just being abysmally slow:
1) Doing the memset at any point of the rendering loop was just as slow
2) Using DMA OR PutDisp made no difference in the memset speed either.
However, I was able to perform a memset operation using the DMA module that was orders of magnitude faster! This was done by setting DMA0_CHCR_0 with burst mode (bit 5) and a fixed source address, and then providing a 32 byte block to repeatedly copy. Runs ~10x faster than memset. This works well enough for me and my framerate is up to 14. Trying to get to 20 but this was the lowest hanging fruit.
I love prizm but stopped developing for it some time before the OS changed from black background to white in the main menu - cannot remember the year. I wonder if you mean something similar to what I used for displaing pokemon icons in one of my add-in - I don't remember why (if for speeding something up, or just background behind the pokemons or something else) but I had the following code and am pretty sure it was handling drawings in a row by row fashion or something like this. I definetely used this technique to draw 92 pixels at once and possibly in other add-ins adapted this to draw entire width of the display or more (don't know if it helps you):
int widthX2 = 184;
int LCD_WIDTH_X2 = 768;
for (int i=0; i < height; i++) {
memcpy(VRAM, (void*)(someAdr) + i*widthX2), widthX2);
VRAM += LCD_WIDTH_X2;
}
I'm a bit late to the party, but I can add a few elements here.
Quote:
1) If I just don't bother to double buffer, and do PutDisp for step 4, my render loop runs at 11.0 FPS. With double buffering, it runs at 11.6 FPS. Does the PutDisp call really only take 5 ms? Or am I screwing up my memory access/DMA somewhere?
At default overclock levels, sending a full VRAM to the display takes about 11 ms with the DMA, with a peak at about 90 FPS.
However this when the ideal situation, when the DMA has full control of the bus during this time. You only gain 11 ms on your frame if you have 11 ms of bus-free work to do. The DMA probably runs in cycle-steal mode, so it's doing work only during the RAM cycles where you're not writing data to VRAM. Burst mode might be more efficient cache-wise, but don't take my word for it.
Since you spend most of your time clearing the secondary VRAM after finishing a frame, the DMA is slowed down considerably. As Mpoupe suggested, doing the computations there is definitely your best bet.
Quote:
2) The Clear the Colors step (in BOTH cases of using DMA for double buffering or not) is really oddly slow. Like it's taking up 30% of my frame time. So it takes about 20 ms for me to run memset(0) on a full screen buffer. This is using the optimized memset I worked out with Lephe. 30 CPU cycles per pixel! Does anyone know of a faster way to clear the screen? Are there caching properties I can fiddle with, etc etc?
From my benchmarks it takes 6.1 ms to clear the screen with an optimized CPU memset() and 2.5 ms with a 32-byte DMA memset() from ILRAM, on an fx-CG 50. Here's the function I use.
Ultimately you can't do parallel access to VRAMs that are both in the main RAM. If you're stuck with that (... and have further performance problems, which appears to be solved for now) I'd suggest testing with a sequential workflow first to have clear benchmarks and then check if parallelizing works. ^^
Hey Lephe, yeah I'm doing a lot of funky stuff that may be interfering with the memset being fast, but DMA still works. I've got a problem with DMA that I'm trying to work out next.
Everything I'm doing now is using burst mode, mostly because the great majority of my computational work involves these buffers. My current sequence is:
1) Clear Buffer A using DMA 'fill' which is done with a 32 byte sequence as the src address and burst mode enabled.
2) Wait for DMA
3) Clear Buffer B using similar DMA fill
4) Wait for DMA
5) Do work on Buffer A and B
6) DMA Buffer A to screen
7) Wait for DMA
I'm aware that some of these waits may not be necessary, but regardless I'm still running into a problem where the second fill in Step 3 simply does NOT work reliably. It appears like it is either reporting as done too quickly or simply skipping batches of 32 bytes at random. I'm still trying to nail down exactly what it is but when I put a giant loop after step 3 it still had the 'corruption', but if I don't do the DMA and do a very slow memset instead it works.
This sequence sounds really reasonable. But I'm not sure how the burst mode is supposed to interact with your waiting. Because the DMAC locks the bus in burst mode, your CPU access to the peripheral register of the DMAC checking whether transfer has finished can (in principle) not be completed until the transfer is finished.
To be honest with you it looks like your DMA transfer is missing cycles, which I naturally want to link to the program interfering with memory access, but the operation is supposed to "withstand" this kind of interference.
Have you tried running it in cycle steal mode just to see whether it has an impact? Since you're waiting anyway this should be functionally the same.
During my tests playing with the DMA, I remember having artifacts in the VRAM-filling operation. I believe it varied depending on the source/destination area and the use of interrupts. Have you tried moving the input buffer to another area (such as ILRAM, which should also be faster)?
Nice ideas! I will try all these things tonight and report back. It's probably still some weeks away but I am excited to share what I've been working on the past few months.
Btw, here is one of the crazier things I am doing (see code snippet). During my main render function I am switching out the stack pointer (r15) with on-chip memory. This increased the speed of my code by ~40%. I don't see any real gains by moving the code to ILRAM though.
Code:
#if TARGET_PRIZM
// So this is a crazy little trick. Put the stack point into chip Y memory for
// the main render loop. Allows us to leverage all stack allocations at low latency,
// which is heavily used in complex code. However, stack allocations must remain
// less than 8K, or we'll.... break things
static const uint32 ChipMem_StartStack = (0xE5017000 + 8192 - 4);
register uint32 OldStack = 0;
uint32 NewStack = ChipMem_StartStack;
asm volatile inline (
"mov r15,%[OldStack] \n"
"mov %[NewStack],r15 \n"
"mov.l %[OldStack], @-r15 \n"
: [OldStack] "+a" (OldStack)
: [NewStack] "a" (NewStack)
:
);
#endif
cacheAndRender(this, toRender, cache);
#if TARGET_PRIZM
// restore the stack
asm volatile inline (
"mov.l @r15+,%[OldStack] \n"
"mov %[OldStack],r15 \n"
: [OldStack] "+a" (OldStack)
: "0" (OldStack)
);
#endif
So FUN, the DMA is broken when *targeting* specific areas of RAM, though the read seems to be ok.
Specifically, when using an aligned buffer from main stack memory or from the GetSecondaryVRAMAddress area, the DMA write is unreliable. It's SOO close though. Just missing chunks here and there. Maybe a clock rate issue?
Moving the stack is such a nice idea! I'd tried moving code with little impact, as you found out, but I never thought about the stack. I'll definitely give it a try later.
Using inline assembler does seem to be risky because the compiler could choose to initialize and remove the stack frame of the function in a way that doesn't line up with your assembler code. I believe it would be slightly safer to at least mark r15 as clobber. The very best way, however, would be the sp_switch
function attribute that makes the compiler generated suitable code for you. ^^
It seems that the missed cycles issue with the DMA is reproducible, then. But I don't know all the details of the timing issues on the calculator so I'm inclined to believe it is a configuration problem. Not only are there constraints on the relative frequencies of Pϕ and Bϕ, there are also multiple parameters influencing access wait times for each memory area in the BSC.
I believe it would be easy to break DMA transfers by mis-configuring RAM delays, and the OS' configuration might not be suitable for all transfers. Any overclocking is also a primary suspect, for instance Ptune changes wait delays in a way that has been thoroughly tested for simple programs but not as much for DMA access.