I am making a 3D Minecraft clone and I was curious about how much RAM I have to work with. Thank you!
You have a 512 kB fixed area, with data segment at the top and stack at the bottom. You can also steal 350 kB from the same-sized area used by the OS, which I believe has worked for years without any documented bad consequences. The heap is about 128 kB, and as far as large stable areas are concerned that's about it.

There are several on-chip memory regions with good to incredible performance, namely the ILRAM (4 kB), XRAM (8 kB) and YRAM (8 kB). The last two are linked to a DSP which is to the best of my knowledge the most powerful heavy-computation option in the calculator.

There is also 434 kB of extremely wild memory linked to two DSPs which may or may not exist (no one really knows; but the RAM is here for sure). Of these, 160 kB can only be accessed with longwords, and the rest has a hole every 4 bytes while also being accessible only with longwords. Basically that's either a last resort or a crazy optimizer's gamble.

Last but not least you can use the storage memory for large data that is not accessed often, which represents up to 16 MB.

Overall, of well-mannered easily-accessible true-and-tested RAM, you have a bit less than 1 MB.
How would I go about using these other areas of memory?
The main 512-kB area is virtualized at 0x08100000, and does not have execution permissions. The very start of this region cannot be used because this is where global variables are normally placed, so you need to start after the data segment. Your SDK should come with a linker script that arranges the segments; it should provide the end address of the data segment (if not, you can add it). With this information, and after choosing how much space you want to reserve at the end for the stack, you can do whatever you want with the ~500 kB remaining buffer. This area can be also accessed through its physical address (bypassing virtualization), but the physical address is not the same on all models, so it takes more work.

The equivalent OS region starts at 0xa8000000. This is the very start of the RAM, accessed from P2 (not virtualized, no cache). You can also access it at 0x88000000, which is the same memory accessed from P1 (not virtualized, but cached). To this date, I have not been able to observe performance differences when using either access from the CPU, although the DMA is easier to use through P2. As with the first region, just get a pointer to there and enjoy. This address is different on the fx-CG 50 (0xac000000 or 0x8c000000), if you use it make sure to not hardcode it, it would be sad to lose compatibility to that.

You can use the heap with the malloc() function call in libfxcg (if you are using PrizmSDK) or the malloc syscall. This is the only heap; in gint (a kernel/SDK that I develop) there is a feature to create heaps from arbitrary regions, maybe libfxcg has something similar that could help. Note that the original fx-CG 10/20 heap is considered not too reliable, but the fx-CG 50 is probably different because it supports MicroPython which requires a solid heap.

ILRAM is at 0xe5200000, XRAM is at 0xe5007000, YRAM is at 0xe5017000. More documentation on this kind of areas is available on WikiPrizm and on the Planète Casio reference. All three are pretty fast and ILRAM can contain code.

The rest, I don't recommend using until you at least have a prototype and some performance results; it was reverse-engineered only a couple of months ago, and while it has some optimization potential and a little bit of storage it is very unusual and limiting. If you still want to check it out, there is documentation here.

Of course, the storage memory can be used with Bfile as usual.

Good luck! ^^
https://prizm.cemetech.net/index.php?title=Non-blocking_DMA

What does it mean to be non-blocking? How do I take advantage of that?
Non-blocking means that the CPU can perform work while the DMA is doing the transfer.

The way the display interfaces with the calculator is by exposing a handful of registers (variables) in memory. By writing into these registers, commands can be sent to the display. Sending image data is one of these commands, and transferring an image to the screen mostly consists in writing pixels over and over at the same address.

Now because of the massive amount of pixels on the screen (about 83'000 in libfxcg), doing this with the CPU is a complete waste of time and not fast enough. Enters the DMA, a giant memory-copying machine that can do the writes in place of the CPU. The DMA is equipped to copy memory faster than the CPU, and is basically the only viable option to send frames to the display in reasonable time (namely 11 ms at default overclock on the fx-CG 50, maybe 20 ms on the fx-CG 10/20 -- the fx-CG 50 has a higher default frequency). As you can see this transfer delay can dominate your frame budget if you're not careful.

In principle the DMA operates independently from the CPU, therefore it is possible to have the CPU execute some code while the DMA is still sending data to the display. However, Bdisp_PutDisp_DD() by itself does not use that opportunity, and waits until the transfer finishes before returning.

What the "Non-blocking DMA" article shows is basically a re-implementation of a simple Bdisp_PutDisp_DD() with two parts : DoDMANonblockStrip() to start the transfer, and the DmaWaitNext() to wait for the transfer to end. If you call the second immediately after the first, you get Bdisp_Put_DispDD(). If you perform some work before calling DmaWaitNext() then you've officially exploited hardware parallelism to increase your program's performance.

Now it is important to understand that the DMA is going to make constant accesses to RAM in order to read the pixels that need to be sent to the display. Even though you can in theory save 11 to 20 ms by working while the DMA transfers, only one of you can use the RAM every cycle. You will save 11 to 20 ms only if you don't use the RAM so that the DMA can proceed with the transfer. Every time you make a RAM access the DMA is delayed because it can't fetch pixels from memory during that cycle. Therefore, if you start the DMA and immediately start writing tons of pixels to another VRAM to prepare the next frame, you're not saving anything. To properly use the parallelism, it is better to do heavy computations during that time.

Also you can't ever call Bdisp_PutDisp_DD() while the transfer is running, or start two transfers at once, obviously. ^^

I apologize if I'm explaining things you already know, I figure it's simpler to go for the detailed story right away and avoid some back-and-forth. I hope it helps. o/
Lephe, your answers are incredible! So good that I updated the non blocking article to link to what you wrote.

That's really good point about how access to RAM is shared. It's something that I understand now but didn't really realize when I was writing programs for the Prizm.
Thank you! I'm happy if it can help you. Very Happy

While we're on the topic of bus sharing, note that the DMA has two modes: "burst mode" and "cycle-steal mode". From what I understand, in burst mode the DMA reserves the bus and any other access is delayed until the transfer is finished. In cycle-steal mode, the DMA waits for cycles where no one uses the bus to fetch its data. Since in general whatever you do during the transfer will have some RAM access to the stack, global variables, or other small things, burst mode is not an efficient option. Therefore cycle-steal mode is selected (bit 0x20 clear in CHCR).

Since the VRAM is often the bottleneck in simple applications, I figure some numbers could help. The following are for the fx-CG 50 at default overclock (117 MHz) : sending to display takes 11 ms, clearing the VRAM by CPU takes 6.1 ms, clearing it with the DMA (you can do that!) takes 2.5 ms, and copying an image from ROM to VRAM takes time roughly proportional to how much data is read and written: about 25 ms for a 16-bit image, 20 ms for an 8-bit palette, 15 ms for a 4-bit palette (output is always 16-bit in the VRAM). The Prizm runs at 59 MHz by default and is traditionally overclocked to 94 MHz. See this topic for some more details and caution.

Given all the 3D work you have to do, video management will likely not be the bottleneck for you, but it cannot hurt to have some rough ideas. I believe it would be possible to have a program run at ~10 FPS if the math works out.
The ROM to RAM copy is a lot slow than I had thought. The 8-bit palette seems like a good idea. Often times an 8-bit image looks close to the true color version especially at the small size.
Yeah lately I have been thinking that the ROM and RAM themselves are pretty slow. I'm reaching trade secrets here (:p), but a DSP copy from XRAM to YRAM is 40 times faster than an optimized 32-bit memcpy in RAM and still 15 times faster than a DMA version. This really makes you think differently of the DSP. I haven't tested the extra wild SPU memory yet but I think you can see how incredible 434 kB of fast to super-duper-fast memory would be.

Edit: Also sorry I confused the two of you in my previous message, I'm too avatar-driven lol.
Wow! It seems like you are discovering some amazing things. 434kb of very fast memory would be tremendous.

When you say DSP copy does that just refer to the fact that those two memory areas are linked to a DSP or do you mean you are actually using the (is there a DSP?) to do the copy?
The processor on the SH7305 is an SH4AL-DSP. It comes with an "integrated" DSP. By integrated I mean it shares its cycles with the CPU so it's more of an instruction set extension, it doesn't operate in parallel with the CPU. The DSP is the intended user of the XRAM and YRAM.

The DSP copy is the copy with the DSP instructions from XRAM to YRAM or the other way. The DSP instructions can perform one XRAM access, one YRAM access and one computation in parallel in a single cycle (although bear in mind that the CPU can execute up to 2 computation instructions in a cycle, if the pipeline is tight enough). Yes the XRAM and YRAM are explicitly linked to the DSP, that's why everything there is so fast.

To clear confusion about the DSP right away: the SH7724 also has a Sound Processing Unit (SPU2) that has two truly-parallel DSPs and 434 kB of weird memory. This memory is called PRAM, XRAM and YRAM in the context of the SPU, and as you can see the PRAM (Program Memory) shows that it really runs in parallel. Because there are two DSPs the areas are split into PRAM0/1, XRAM0/1, YRAM0/1, and I use these names to avoid confusion with the integrated XRAM and YRAM (which are much smaller, 8 kB each). PRAM0/1 is pretty standard but has 32-bit access only, XRAM0/1 and YRAM0/1 are similar but only 24 out of every addressable 32 bits are actual memory, because the native data type of these DSPs is 24-bit fixed point.

The SH7305 has the SPU memory, and there should be a DMA controller dedicated to that memory (that includes helpers to work around the holes in XRAM0/1 and YRAM0/1), however I have found no evidence of the DSPs themselves and because we probably can't use them blindly I tend to think of the SPU as just additional RAM.
I realize I forgot to mention that the time to copy an image was using a full-screen image, ie. 396x224 (384x216 isn't much of a difference, only 7%). That means about 170 kB of data written and the same amount of data read (half in 8-bit mode, a quarter in 4-bit mode). So yeah while it's slow it's generally not something that happens too often. The ROM reads were virtualized too, because determining the physical address of files in ROM is currently still shaky in all situations I know of.
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 1 of 1
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement