Login [Register]
Don't have an account? Register now to chat, post, use our tools, and much more.
I've been browsing the SH7724 hardware manual with an eye towards possible ways to accelerate games, and I've found at least a few on-chip peripherals that could be useful (provided they're present in our CPU). Some notes on what I've found (and a few miscellaneous other thoughts):

DMAC - not a lot to say, it's a DMA controller. Can be useful for large copy operations, or could be abused to implement something like memset(), which would be handy for clearing the screen. Some care would be needed to ensure that only regions that have already been erased are modified by the CPU, but that's fairly easy just by polling the control registers.

Bdisp_Putdisp_DD - I've been assuming this function uses DMA, but haven't really seen any proof either way. Certainly it's possible, but it may actually eat CPU time while copying to the display. This could be tested quite easily by seeing what sort of framerate can be achieved in a tight loop over the syscall while polling the current wall time.

If it does indeed block while copying, that's an optimization opportunity, since the CPU could be doing useful operations while blitting VRAM to the screen. The only important thing to keep in mind would be that VRAM cannot be modified until the blit is complete, otherwise you'll see partial updates. This could greatly improve performance in programs that do heavy computation every frame.

If the syscall doesn't block while copying, that probably means it's internally double-buffered. It's less useful when optimizing for speed, but the secondary buffer could be used as additional RAM space if the double buffering isn't needed.

IL memory - 16k of on-chip memory (base address 0xE5200000), so it's very fast. AHelper confirmed some time ago that it's available for use in our own programs via a bit of experimentation. It's a rather small amount of memory, but still large enough to hold useful data. You might want to put a performance-critical data structure in this memory, or just copy (decompress?) your graphics data here for slightly faster blitting.

2D-DMAC - this peripheral would be perfect for blitting sprites to VRAM.

Simply set a few pointers and size parameters, and it can blit from a spritesheet onto your display buffer without eating CPU time. It can also do inline rotation, inversion, and color conversion, which could be useful for certain programs. This is wonderful for spriting if it's actually in our CPU. Point into the sprite sheet and it does the necessary scattering.

The performance benefits may not be useful for small sprites, though. Would require experimentation to see if it's worth the additional code complexity to use this module.

MERAM - 128k of on-chip memory, so the basic use case is like IL memory (base address 0xE8080000), but probably a little bit slower. More interestingly, it can function as a cache for several of the more specialized peripherals. Of those which can connect to the MERAM, though, only one appears to be of use to us on this hardware:

BEU - this module composites up to three different images into a single display. Most interestingly to me, it can do full alpha blending. The obvious option here is to display a HUD in games, but it may be possible to use this module just for blitting sprites with alpha channels (similar to the 2D-DMAC). Another compelling option is to split your game's graphics into fore- and background layers to avoid redrawing the entire display every frame (assuming the background may update less often).

It also appears to be capable of decoding palettes, which would simplify (and probably accelerate) the use of lower-bit-depth images (in order to save memory). I could see some programs rendering to VRAM in a palletized color space, allowing the system VRAM buffer to actually contain two reduced-depth buffers.

In conjunction with MERAM, it may be possible to do efficient composition and buffering for screen rendering. The MERAM has a frame buffer mode which may be useful in its own right, but it can also take input from the BEU. By combining a screen blit operation with the BEU into MERAM with a normal screen blit (probably via DMA), we could achieve very good performance in fairly complex scenes (particularly when alpha blending is involved).

Concluding
There's a lot of speculation here. We know the Prizm CPU has functional IL memory, so that's something easy to use. I assume DMA is functional, but haven't seen any confirmation either way of that. The remaining modules may or may not be present, and I have no idea.

If somebody wants to do some experimentation to try to determine what's available, that would be awesome. As I continue work on pLemmings, I'll probably be attempting to use some of these techniques (if the hardware is available), so there's a good chance of libfxcg support.

So. Thoughts? Anybody want to test whether these things are possible?
The latest version of INSIGHT has an item called LCDDMA which, when run, shows color stripes on the screen. I don't know if that's related to DMA or not.
EDIT: http://ourl.ca/8207;msg=260049

BTW: I'd also like to know about some piece of executable memory that both survives shutdown and isn't used by the OS, so one could copy a timer handle to it, then install the timer on slot 3 and let the timer handle run out of an add-in (useful for showing a clock at every point in the OS, for example).
Sounds like VRAM is indeed double-buffered by the syscalls that copy to the display, then. For posterity, here's the linked post.
SimonLothar wrote:
It seems to be not too simple to write to the Prizm's LCD-driver-interface 0xB4000000.

First the LCD-driver-register has to be selected.

If the OS issues a LCD-driver-register-select (syscall 0x01a2), it first clears bit 4 of MPU-port 0xA405013C (it looks as if this bit controls the LCD-driver's RS-bit; refer to the R61509-manual).
Then the SH-4A instruction SYNCO is performed.
Then the register number is written to 0xB4000000
Then the SH-4A instruction SYNCO is performed.
Then bit 4 of port 0xA405013C is set again.
Then the SH-4A instruction SYNCO is performed.

After a LCD-driver-register has been selected, it can be written to (by writing to 0xB4000000 again)

Every time register 0xB4000000 as well as port 0xA405013C are written to, the SH-4A-specific instruction SYNCO is performed immediately.

It is doubtful that directly writing to the LCD-driver enhances display-performance.
I observed, that writing to VRAM and moving the RAM to the LCD by DMA (syscall 0x025F/syscall 0x0260) is significantly faster than direct LCD-driver access.
Syscalls 0x25F and 0x260 are Bdisp_PutDisp_DD and Bdisp_PutDisp_DD_Stripe, respectively.

I'd still like to see numbers on exactly what the maximum framerate possible with that syscall is, and how much computation can be done while waiting for the operation to complete. I assume it simply stalls until the current DMA transfer is complete if there's one in progress. The performance boost may simply be because the DMA controller is more efficient in its use of memory bandwidth than the CPU is.

If VRAM is indeed double-buffered by these syscalls, reading the DMA control registers after doing such a copy operation should tell us where in memory that actually is (and allow us to use it for our own purposes).

Something I forgot about while typing this up earlier: the UBC. Provides on-device debugging functionality. Haven't read much into it, but may be handy for reverse-engineering or simply debugging our own software if it doesn't require that the CPU be in supervisor mode to configure it.
I would like to add also question about memory areas, which can be used for memory hungry application (like cgdoom or jpeg based cgplayer).
Current map of memory as I know:

0x880A2AD5..0x880CB2D5:SaveVRAMBuffer 165888 bytes, unaligned address
0x880F0000..0x8815FFFF: system stack (512 kB), it seems it is dangerous to use begin and end of this area, see sources of cgdoom and cgplayer for details
0x88160000..0x881DFFFF: add-in stack (512 kB), it is better to use it standard way (as stack and static data)
0x881E0000..0x881FFFFF: heap. (128 kB)

Copied from original post:
0xE5200000 IL memory - 16k of on-chip memory
0xE8080000 MERAM - 128k of on-chip memory
MPoupe wrote:
I would like to add also question about memory areas, which can be used for memory hungry application (like cgdoom or jpeg based cgplayer).
Current map of memory as I know:

0x880A2AD5..0x880CB2D5:SaveVRAMBuffer 165888 bytes, unaligned address
0x880F0000..0x8815FFFF: system stack (512 kB), it seems it is dangerous to use begin and end of this area, see sources of cgdoom and cgplayer for details
0x88160000..0x881DFFFF: add-in stack (512 kB), it is better to use it standard way (as stack and static data)
0x881E0000..0x881FFFFF: heap. (128 kB)

Copied from original post:
0xE5200000 IL memory - 16k of on-chip memory
0xE8080000 MERAM - 128k of on-chip memory


I'd like to add (as seen on Simon's docs):
0xA8000000..0xA80287FF: VRAM (165888 bytes)
0xA80D35D0..0xA80E3000: Main memory (64KB)
0xAB800000..0xABD0FFFF: Main memory backup area (64KB)

All summed up (except IL and MERAM), it's about 1612 KB out of the 2MB RAM chip...

EDIT: I see you're using the 0x80000000 addresses, which are cacheable. In that case my post becomes:
0x88000000..0x880287FF: VRAM
0x880D35D0..0x880E3000: Main memory
0x8B800000..0x8BD0FFFF: Main memory backup area
mpoupe wrote:
which can be used for memory hungry application (like cgdoom or jpeg based cgplayer).

As far as JPEGs go, the datasheet describes a JPU, JPEG processing unit. It's capable of both encoding and decoding images up to 4096 pixels square; described in chapter 37 of the manual. It has connections to the MERAM unit, so you could conceivably stream uncompressed data into MERAM and do scaling on the fly from there. Once again, I'm not confident that this peripheral is actually available on our chip, so tests are necessary.
Tari wrote:
mpoupe wrote:
which can be used for memory hungry application (like cgdoom or jpeg based cgplayer).

As far as JPEGs go, the datasheet describes a JPU, JPEG processing unit. It's capable of both encoding and decoding images up to 4096 pixels square; described in chapter 37 of the manual. It has connections to the MERAM unit, so you could conceivably stream uncompressed data into MERAM and do scaling on the fly from there. Once again, I'm not confident that this peripheral is actually available on our chip, so tests are necessary.


If the Prizm really has a JPU, I guess the work I've been doing over the past months trying to speed up the picojpeg library is pretty useless.
At the same time I feel that it's unlikely that the Prizm has a JPU, because if that was the case the g3p picture format would use JPEG compression techniques instead of Zlib (or is Casio really so dumb?).
Just remember: You are looking at the datasheet for the SH7724, not the SH7305. The CPUs don't match hardware peripherals all of the time. Most likely Casio removed all non-essential hardware. It is unlikely that the Prizm has a JPU because why bother paying $$$ for hardware when you can use FOSS? The same goes with the FPU. Why spend money if you can use something like GMP?
Tari wrote:

2D-DMAC - this peripheral would be perfect for blitting sprites to VRAM.

Simply set a few pointers and size parameters, and it can blit from a spritesheet onto your display buffer without eating CPU time. It can also do inline rotation, inversion, and color conversion, which could be useful for certain programs. This is wonderful for spriting if it's actually in our CPU. Point into the sprite sheet and it does the necessary scattering.

The performance benefits may not be useful for small sprites, though. Would require experimentation to see if it's worth the additional code complexity to use this module.


Wow, if this syscall could be implemented in LuaZM, that would be awesome. Especially the rotation part/
AHelper wrote:
Just remember: You are looking at the datasheet for the SH7724, not the SH7305. The CPUs don't match hardware peripherals all of the time. Most likely Casio removed all non-essential hardware. It is unlikely that the Prizm has a JPU because why bother paying $$$ for hardware when you can use FOSS? The same goes with the FPU. Why spend money if you can use something like GMP?
On the other hand, it might be easier to just leave some of the IPs in rather than having to go through the trouble of taking them out, to play devil's advocate. Tari, this is a great piece of documentation, and I hope it will soon either be linked from the wiki or copied there. Smile
KermMartian wrote:
AHelper wrote:
Just remember: You are looking at the datasheet for the SH7724, not the SH7305. The CPUs don't match hardware peripherals all of the time. Most likely Casio removed all non-essential hardware. It is unlikely that the Prizm has a JPU because why bother paying $$$ for hardware when you can use FOSS? The same goes with the FPU. Why spend money if you can use something like GMP?
On the other hand, it might be easier to just leave some of the IPs in rather than having to go through the trouble of taking them out, to play devil's advocate. Tari, this is a great piece of documentation, and I hope it will soon either be linked from the wiki or copied there. Smile
Note that it is likely that they created the core and just chose the peripherals to use. (Either you buy a pre-made CPU or you choose the core and peripherals and such).

Also, how would one execute instructions outside of ROM? I was trying to execute from RAM and recall that this was an issue before. Does anyone know if this issue was resolved?
I don't want to add anything to the wiki until I know the hardware is available. I agree with AHelper that there's a good chance these things aren't present, but they should be tested in any case. I'd try to do so myself, but I won't even be in a position to do anything with a real Prizm for about another two weeks (and even then, I have a lot of other projects that want my time).

AHelper wrote:
Also, how would one execute instructions outside of ROM? I was trying to execute from RAM and recall that this was an issue before. Does anyone know if this issue was resolved?
I don't know of any technical reason it wouldn't be possible, so I'd have to see your code to get any ideas.
INSIGHT runs from RAM if I remember correctly.
gbl08ma wrote:
All summed up (except IL and MERAM), it's about 1612 KB out of the 2MB RAM chip...
And what about the rest? 436 KB is huge buffer (in our context).
I hope there are safe areas to be used.
gbl08ma wrote:
0xA8000000..0xA80287FF: VRAM (165888 bytes)
This is nice buffer, but direct DD access (by syscalls) is slow.
I have to sync with the latest news here, I tried direct LCD access in CGDoom some time ago, but didn't work for me.
Do we have some working demo project for it?
I mean the routines defining window on LCD and setting pixels.
gbl08ma wrote:
0xA80D35D0..0xA80E3000: Main memory (64KB)
0xAB800000..0xABD0FFFF: Main memory backup area (64KB)
Overwriting this means, that the application must reboot the calculator on finish, am I correct ?
MPoupe wrote:

gbl08ma wrote:
0xA80D35D0..0xA80E3000: Main memory (64KB)
0xAB800000..0xABD0FFFF: Main memory backup area (64KB)
Overwriting this means, that the application must reboot the calculator on finish, am I correct ?


It must reboot, and I bet it will tell "MAIN MEMORIES\n CLEARED!". I set up a timer that called the Test mode, then let the timer run calling more test modes on top of each other... resulting in a system error because after some Test Modes having been opened, they start not fitting in memory, so they overlap other areas... I also found out timers do not stop when there is a system error: the error message would change as the timer was trying to call the Test Mode once again!
When I rebooted the calculator (and this was on the emulator), it cleared the main memory, because it had been overwritten with the Test Mode code.

EDIT: The Prizm also has some kind of task-switching in the eActivity strips (press Shift plus the arrow above AC/ON to switch between an open strip and the eAct document). I'm yet to understand very well how it works, but it seems to use an huge buffer and one of the six MCS backup areas (each strip has its own MCS).
GCC features you should always use if you want the generated code to be as efficient as possible: -flto. If you can cross-compile, you can also profile (-fprofile-generate, -fprofile-use) which will probably yield bigger gains.
  1. Compile with -fprofile-generate
  2. Run the binary through typical usage to generate profile data
  3. Compile with -fprofile-use
Profiling allows it to optimize for real-world control flow patterns, so branches can be rearranged to perform better in the typical cases, hints to the CPU's branch predictor can be inserted, etc.
Out of curiosity, how well do those hints and rearrangements carry over to non-SH3/4 code? For example, is it worthwhile to perform those profiling steps on critical code in a low-level (for example) networking driver/application, or is there a danger of the profiles being too specific to the test cases and doing more harm than good?

On a side-note, I noticed that this topic gets quite a few hits from Omnimaga.
I'd say it can be useful in all code. It comes down to being sure you have good test cases (fix them if they don't help performance..) and trusting the optimizer to emit correct code.
It'd be awesome if someone could dedicate some of his time to understand how to run code from the IL-RAM, and to find out if the MERAM, the JPU and the BEU are indeed present or not (IMO chances are not, but we'll never know for sure without testing).

Also, from where do we know the Prizm only has 2 MB of RAM? Was it from Simon's docs, or from looking up the specs of some chip in the Prizm? If we're just looking into memory addresses to guess, then the TLB may be getting us wrong... another interesting thing one could work on would be seeing if there's a way to bypass the TLB and work with real addresses. Unfortunately I don't have enough knowledge or time to work on any of these points.
I'm bumping this. I know most likely no one has had the time or interest to have a closer look at the possibilities of the Prizm's CPU. But I lack the knowledge to do it. So this is staying here as a reminder that once someone has some free time and is interested in poking at Prizm hacking, this is yet to be done.

As for memory areas, Simon has some great information on his docs (it's just all a bit scattered around them), and I have experimented with some regions already. I may do a better compilation of all the memory areas and their descriptions soon.
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 1 of 1
» All times are GMT - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement