So indeed the contiguous-access method you proposed is the way I was rearranging the pipeline. Doing this in the 32-byte loop shaves another microsecond off the XRAM/YRAM benchmark, at which point I think I will start using the 0.14 µs-level measure because it's getting really small. The take-away is: it's indeed faster, but only slightly.

To understand exactly the reason, I went back to the pipeline documentation. The mov instructions we're using have a 3-stage register fetch + memory access sequence that blocks the CPU's LS region, meaning that one can execute only every 3 cycles at best. If there is no data dependency, this should be the limit. When there is a data dependency, the WB stage that follows the 3-stage register + memory access sequence brings this limit to 4 cycles per instruction. So the improvement of freeing the pipeline cannot be more than 25%.

Now the only think I still don't understand is how the absolute time can be so small. Eliminating every overhead source, the function absolutely needs 6 pipeline cycles for 4 bytes of data, and I measured only two Iϕ cycles for 4 bytes of data. Maybe my understanding that Iϕ is the frequency of the pipeline is incorrect; do you have any information on this aspect?

Quote:
Hmmm.. like I said there is a terrible little cache in this CPU, I think it is 32 bytes? So we might be "thrashing the cache" by moving back and forth. What happens if we trash more registers and keep access contiguous, something like this? Using whichever registers.. I just adapted the inline code sample I had:

Oh so I was unsure about this bit. The documentation of the SH7724 hints at such a ridiculous cache being used, but it seemed really wrong so I mostly ignored it. Do we have quantified evidence of this? I think I'll try some test programs because it's normally easy to find cache parameters by manipulating memory, and I'd like to be sure.
No, the only thing I have to go on is the SH7724 docs like you, and just experience with various assembly tweaking. There is an instruction that does appear to 'work', pref @Rn, but I haven't been able to get any meaningful performance out of using it. It may just be that the operand cache isn't enabled on this cpu but the instruction cache is.

In regards to clocking, the pipelined architecture should allow these specific instructions to overlap, no? That would get us faster than the absolute instruction clock rate.

I did find this particular memcpy implementation, though I think it's a tad absurd for our usage, may be able to shed some light on how others have pipelined memcpy for speed.
Ok so let's try to work it our on our own I guess! Maybe we should open a new topic?

In any case, I've checked the µlibc implementation which is wonderful in many aspects (and also too complicated for us). At line 662 which is the equivalent of your optimized loop, you can see that all instructions are of the LS style so there is 2 cycles of latency between each, just as I had found in the documentation.

To challenge my assumption of Iϕ being the pipeline cycle frequency I ran a test program that performs 16 iterations of a 256-nop loop, with the assumption that the function call and 16 loop iterations will be negligible against the 4096 nop cycles. To cut to the chase, the TMU counted 18.9 µs but 4096 cycles of Iϕ would need 35.3 µs.

The conclusions are :
• The CPU likely executes pipeline cycles faster than Iϕ
• The 19µs measurement for your version of memcpy() with my pipeline rearrangement on the XRAM/YRAM is not absurd

This means that this memcpy() is 20 times faster in the XRAM and YRAM than in normal RAM, which is very interesting. After a similar optimisation, memset (which is not any faster in normal RAM) reaches around 350 MB/s. The naive C byte read/write is much less clear-cut but after checking GCC did a horrible job so I'm not surprised. (And I haven't timed DMA transfers nor parallel DSP instructions yet! xD)

Now all that's left is to hope that the SPU's DSPs PRAM, XRAM and YRAM which make up for almost 1M have the same performance.
Tari this is the error I get when trying to run clean on libfxcg when using the forloop approach you suggested :



Code:

C:\Projects\Prizm\PrizmSDK\libfxcg>..\bin\make.exe  clean
del "syscalls\APP_EACT_StatusIcon.o"   -del "syscalls\APP_FINANCE.o"   -del "syscalls\APP_LINK_transmit_select_dialog.o"   -del "syscalls\APP_MEMORY.o"   -del "syscalls\APP_Program.o"   -del "syscalls\APP_RUNMAT.o"   -del "syscalls\APP_SYSTEM.o"   -del "syscalls\APP_SYSTEM_BATTERY.o"   -del "syscalls\APP_SYSTEM_DISPLAY.o"   -del "syscalls\APP_SYSTEM_LANGUAGE.o"   -del "syscalls\APP_SYSTEM_POWER.o"   -del "syscalls\APP_SYSTEM_RESET.o"   -del "syscalls\APP_SYSTEM_VERSION.o"   -del "syscalls\AUX_DisplayErrorMessage.o"   -del "syscalls\Alpha_GetData.o"   -del "syscalls\Alpha_SetData.o"   -del "syscalls\App_InitDlgDescriptor.o"   -del "syscalls\App_LINK_GetDeviceInfo.o"   -del "syscalls\App_LINK_GetReceiveTimeout_ms.o"   -del "syscalls\App_LINK_Send_ST9_Packet.o"   -del "syscalls\App_LINK_SetReceiveTimeout_ms.o"   -del "syscalls\App_LINK_SetRemoteBaud.o"   -del "syscalls\App_LINK_Transmit.o"   -del "syscalls\App_LINK_TransmitInit.o"   -del "syscalls\App_Optimize.o"   -del "syscalls\BCDtoInternal.o"   -del "syscalls\BatteryIcon.o"   -del "syscalls\Bdisp_AllClr_VRAM.o"   -del "syscalls\Bdisp_AllCr_VRAM.o"


This is using my -del version but the same things happens with the if exist version you wrote :-/
I found that this works for me for $(call rm,$(OBJECTS)):


Code:

ifeq ($(OS),Windows_NT)
define rm
  del /S /Q $(foreach deleteFile,$(1),$(subst /,\,$(deleteFile)))
endef
define rmdir
  rmdir /S /Q $(foreach deleteDir,$(1),$(subst /,\,$(deleteDir)))
endef


While this doesn't work with quotes and/or spaces in file names, pretty much all of my makefiles are path relative with no spaces, so maybe this works best?
Yeah this worked for me in every context, in my version at least I have set up a base_rules now that all the makefiles derive from, including the libfxch and libc makefiles. These rules includes the MACHDEP define, toolchain setup, and these platform defines for deletion so everything can be uniform.
Cool, thanks. I misunderstood what foreach would do, but you got it sorted. I've pushed a change that does what you provided.
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 3 of 3
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement