As per Lephe's suggestion, moving this conversation to a new thread as we work out some new base memory routines and also investigate pipelining performance in various contexts.

Lephe, checking the Renesas SH7724 manual, as per page 97, all the mov opcodes in question should be able to stage nicely for pipelining. During the second 4 stages of the 8 stage pipeline (during data fetch and writeback) the instruction fetch and decode is free to run on the next mov instruction. As a result, these instructions should run at 2x per cycle.

So. Assume that all 19 instructions in my 32 byte loop run fully parallel like this. At an effective 232 Mhz due to this parallelism that is 32 bytes per 19 instructions = 390 MB/s. The pipeline/cache slow down for the other instructions knocks it back to 350. Smile

Do you mind posting your modified memcpy.S? I'd like to work out with Tari how I can get these integrated as part of this SDK. We can work together on other important routines if you want and get them into both PrizmSDK and gint Smile
For other readers' reference, the beginning of this discussion. Motivation is in providing an optimized memset implementation for libfxcg.
Thanks for creating the topic! Just to clarify for Tari, I use a custom unikernel (called gint) in place of libfxcg and the Prizm SDK so I might not be able to contribute directly repository-wise. I also mind older SH3 machines. Other than that, no relevant differences. ^^

While I'm on the meta level, may I know which models you're using? In France the Prizm was not successful commercially, so essentially there is only the fx-CG 50 (or rather its French equivalent which is called the Graph 90+E). But I seem to understand that you still have Prizms around? Is this the case?

Quote:
Lephe, checking the Renesas SH7724 manual, as per page 97, all the mov opcodes in question should be able to stage nicely for pipelining.

You reminded me of something important: even though optimal parallelism is achieved when an instruction is executed every pipeline cycle (which is what I was somehow aiming for), this CPU is a 2-ILP architecture so you can't do anything better than one instruction every 3/4 pipeline cycles.

I thought a "cycle" was a pipeline stage, but actually it should be the full 7-stage pipeline. Once you account for 2-ILP, it follows that a nop should take half a cycle, which is what I observed after measuring Iϕ yesterday. It all makes sense now, thanks!

By the way, while we're breaking records:

• Using the DMA on the XRAM/YRAM is much slower than using the CPU, most likely because of the location of the XRAM/YRAM (close the DSP and CPU and probably further away from the SHway).
• However using the parallel DSP instructions makes memset() and memcpy() reach both 450 MB/s. Essentially the full 8 kB can be set or copied in 18 µs by the DSP.
Here are the updated memcpy.s and memset.s.

You will need to replace this bit by something that sets T if and only if you're running an SH4 machine (as opposed to an SH3 one).


Code:
   /* If unaligned but SH4, use movua.l */
   mov.l   .gint, r0
   mov.l   @r0, r0
   tst   #1, r0
   bt   .unaligned4

If you only care about SH4 machines, you can jump straight to .unaligned4 and remove the .aligned2 section which is used only on SH3 since unaligned longword moves are faster on SH4.
Sweet I will make the necessary changes so this compiles correctly. Looking over the asm it's very well put together. Thank you! And yes libfxcg is only for the SH4 calcs. It supports CG-10, CG-20, CG-50, and Graph 90+ E.

Tari, I'm not entirely sure how to switch things around in the toolchain. Can you point me somewhere? I should be able to figure it out.
I realized I could improve the logic and more importantly the small-region case by using Duff's device. I will probably look at that when I get to time the smaller cases. This might come in handy in a few situations, including moving items around in slab allocators.
It would take a little creativity but you could do the switch with a PC offset using the braf Rm instruction! That would be snazzy. I think that would be easiest if you were willing to clobber more registers.
Yes that's exactly what I have in mind. The switch-formulation is more of a C artifact. It will also probably be faster to align at the start and might help with the end (currently because each level is a do/while loop you have to copy a few bytes at the end no matter the size).
tswilliamson wrote:
Tari, I'm not entirely sure how to switch things around in the toolchain. Can you point me somewhere? I should be able to figure it out.
Create memcpy.S in /libfxcg/misc/ and remove the redefinition to sys_memcpy in fxcg.h.
Unfortunately, after looking at this the past hour this is WAY more of a rabbit hole than that.

* The define check around memcpy in fxcg.h is not used.
* There is no memcpy implementation currently in libc or libfxcg
* It appears the memcpy implementation is inside of libgcc!
* Libgcc is necessary for other builtins, such as division and float simulation functions (though these appear to only be used in weird functions like strtod)
I spoke too soon! _memcpy is actually defined as a symbol by a syscall already, memcpy.S. This actually may represent a good way to setup malloc and free. Rather than relying on any aliasing, the symbols for those could be directly defined as the syscall implementations.

Libgcc appears to reference the symbol, not actually implement it.

Anyways, I have something to work with, sorry for the back and forth.
Ok it all works great! Confirmed that it fixed the bugs I was seeing from memcpy previously Tari.

See this commit to PrizmSDK:

https://github.com/tswilliamson/PrizmSDK/commit/29a027029a290de8739879c1119f764d0caf9b43

Thanks for the code Lephe Smile
Neat; I've pushed a cherry pick of that to libfxcg.
Super glad it worked out! This was also an improvement on my side. There's not a lot of in-depth or discussion like this on Planète Casio (unfortunately) so looking forward to more Smile
memset.S now incorporated as well, relevant changes are in this commit:

https://github.com/tswilliamson/PrizmSDK/commit/066a171454655f559ce7db897c5fd93c9625f8b4

Sorry for the messy commit with the other stuff, was just updating the VS project I use for convenience building libfxcg/libc
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 1 of 1
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement