That sounds great Lephe! Yeah I think it is mostly just a very small community already so it would good to have everyone working together.

Tari, this commit, specifically the printc change, fixes a significant bug in the SDK

https://github.com/tswilliamson/PrizmSDK/commit/c81d88a7aaef64cef91014f905818654f82633a5#diff-08bd32f030b7be764fa342f18d5431cd
Cool, I've upstreamed those changes in 9d2a4b681da4077971e87143d5d843fce9ed2ae3 and ea23b9abf91017f2a2c2d07ba84e420cd95eddc9.
I'm considering how I want to move forward with my SDK tweaking. Obviously I've made a lot of changes. I have your version working over here and am now having good results with latest. NESizm is 50 kB lighter (from 300 to 250) and runs about 5% faster! So once I've smoothed out the wrinkles on my end (for some reason clean is not working, and a rebuild without any changed files is telling me there is no action for the first .o file)

If I fork right now, it'll be a pretty severe fork, as my project makefiles and utility set up are a bit different, but you'll be able to more easily perform pull requests. What do you think?

Things I'm considering trying out in the future:
1) Inlining most sys calls into assembly macros. There's no need for separate symbols most of the time and I feel inline assembly will be more lightweight and straightforward. Won't need all those separate .S files!

2) Incorporating much stronger commenting in the fxcg headers, based on the WikiPrizm data. It's sort of a pain to constantly reference an outside resource for parameter description, etc.

3) Having better crash handling tools integrated. Right now my best approach to odd crashes is to output the linker map and find where the PC indicated in the dialog lies. Would be nice to incorporate a better crash handler that walks an output map for us, possibly scanning the stack for program area pointers to indicate as much as possible.
Why can't we work on the build system so it's compatible with what you want? Then you don't need a fork at all (which is the goal).

The main difference I see is your utils directory containing extra libraries; for that I think a common makefile interface would do fine. For instance if you want to use the zx7 library, define UTILS=zx7 and the makefile attempts to build and link $PROJECT/utils/zx7 and $SDK/utils/zx7 (in that order) by running make staticlib or something in each directory, then adding a lib subdirectory to the linker search path.

Some of the libraries can reasonably be provided with the SDK, but looking inside the project is the more generally useful feature- then projects can include whatever libraries they need and don't have undocumented dependencies on external code.



Thinking more: there's also no reason a project can't use its own makefile that includes libraries with the project and just include prizm_rules as necessary.

I've gotten a start on a version of nesizm that includes the "util" libraries as submodules rather than using an SDK fork: https://github.com/tari/nesizm

The trick is just in structuring the libraries in a way that they can easily be included, and using a submodule.
I have mixed feelings.

I originally had a submodule type set up for this, and included these separately in the project folders. I moved to the SDK/utils folder because I found so much common usage between Prizoop and NESizm and realized it could be useful for other people as well (though so far.. not very many other people working on anything significant with the SDK anyway)

The other reason I started pushing them all into a single higher level repo was my windows simulator, which doesn't have separate libraries and instead just rolls all the utils I've written and added to my version of the SDK along with the fxcg emulation. Each project easily links against that when building for the simulator. This would have to be reworked as a separate project as well that also submodules the same way, which would still need to externally link from the main app repos.. and I would have 3 clones of each submodule in my SDK folder tree just for NESizm and Prizoop. It feels messy.
Hi Tari! In the interim, I've pushed my commits to PrizmSDK using your latest, with NESizm and Prizoop working with it. Smile

I did encounter a number of issues I had to work around:

1) There is a pretty big bug related to the memcpy builtin I think. If you revert my change to imageDraw.cpp in my latest commit to NESizm, this image blit function that uses memcpy fails at O2 and Os but does not fail at O3. I prefer O2 for the rest of the application performance so I committed this work around until we can figure it out. There's another memcpy fixed in that commit but it was a bug with loading settings that was much more subtle, appears to be the same issue though.

2) In the Prizoop makefile for my latest commit, if I don't add -std=c++11 to my C++ compiler flags, it outputs a linker error on the delete() intrinsic (I'm not entirely sure, just a hunch) due to a simple struct delete I am doing in cgbCleanup() in cgb.cpp. So I'm guessing the default C++ std needs some additional library or defined function to always work properly.

3) I'm still not clear on why malloc() and free() do not work with the new SDK. Both projects are defining malloc as sys_malloc and free as sys_free. This isn't ideal and fairly non-portable IMO. Am I missing something? See either project's change to platform.h

4) With the way RM is being defined per platform, rm and rmdir would cause make to fail pretty easily on a second build attempt, because these calls toss an error if the file doesn't exist and halts the makefile. See my changes to prizm_rules to those defined functions for something that works, but I don't think is ideal.

5) The Bfile_GetBlockAddress sys call is vital to file system performance in real time applications and should be included in the SDK.

6) This commit is a simple CG-20 crash fix for sprite routines for any project built against the SDK. 0xA8000000 hard coding should be highly discouraged.
tswilliamson wrote:
1) There is a pretty big bug related to the memcpy builtin I think. If you revert my change to imageDraw.cpp in my latest commit to NESizm, this image blit function that uses memcpy fails at O2 and Os but does not fail at O3. I prefer O2 for the rest of the application performance so I committed this work around until we can figure it out. There's another memcpy fixed in that commit but it was a bug with loading settings that was much more subtle, appears to be the same issue though.
memcpy is delegated to sys_memcpy so I can believe that might not work correctly. Might be helpful to compare the object code to see what it's doing differently to check that it's actually the syscall being weird. Filed a bug to investigate.

Quote:
2) In the Prizoop makefile for my latest commit, if I don't add -std=c++11 to my C++ compiler flags, it outputs a linker error on the delete() intrinsic (I'm not entirely sure, just a hunch) due to a simple struct delete I am doing in cgbCleanup() in cgb.cpp. So I'm guessing the default C++ std needs some additional library or defined function to always work properly.
new/delete requires library support, yes. You need to implement that yourself by defining an overload or (as you're doing) depend on optimization to elide calls to a library that we don't provide.

Quote:
3) I'm still not clear on why malloc() and free() do not work with the new SDK. Both projects are defining malloc as sys_malloc and free as sys_free. This isn't ideal and fairly non-portable IMO. Am I missing something? See either project's change to platform.h
I assume the existing aliasing issue would help with this, but I haven't explored the actual issue.

Quote:
4) With the way RM is being defined per platform, rm and rmdir would cause make to fail pretty easily on a second build attempt, because these calls toss an error if the file doesn't exist and halts the makefile. See my changes to prizm_rules to those defined functions for something that works, but I don't think is ideal.
Huh? It's specifically avoiding trying to delete files that don't exist, I don't see the difference. Can you provide the output of it doing the wrong thing?

Quote:
5) The Bfile_GetBlockAddress sys call is vital to file system performance in real time applications and should be included in the SDK.

6) This commit is a simple CG-20 crash fix for sprite routines for any project built against the SDK. 0xA8000000 hard coding should be highly discouraged.
Cherry-picked these.
Quote:
5) The Bfile_GetBlockAddress sys call is vital to file system performance in real time applications and should be included in the SDK.

How does this syscall work? Do you know how it was discovered?

BFile performance is a real pain to deal with on the fx-CG 50 so if there are known ways to make it faster I'd really like to know. ^^
Quote:
Huh? It's specifically avoiding trying to delete files that don't exist, I don't see the difference. Can you provide the output of it doing the wrong thing?


There's no specific error code until it's too late. Specifically, if I do an rm call with multiple files, like this:

$(call rm,$(OUTPUT).bin $(OUTPUT_FINAL).g3a $(OUTPUT_FINAL)_cg10.g3a)

then this would cause "if exist" to pass multiple arguments which just silently fails in windows, not deleting the files, regardless of whether they exist or not.

Btw, I fixed the delete issue by implementing the sized versions of delete() and delete[]() operators in my project, so it's back on C++14 again. I don't think that this is good behavior. Base implementations of these operators should be expected to be part of the base SDK, especially for new programmers using C++ for the first time.

Regarding sys_memcpy: is this different from 0.3? I haven't had any memcpy issues before. That being said, I have rolled my own a couple of times because the existing memcpy is very slow. I'd be happy to volunteer a new implementation in assembly if you can get it incorporated to the base SDK.
Quote:
$(call rm,$(OUTPUT).bin $(OUTPUT_FINAL).g3a $(OUTPUT_FINAL)_cg10.g3a)
Oh, sure. It's not meant to take more than one file. Try looping over them:
Code:
$(foreach f,file1 file2 file3,$(call rm,$(f))
(this could easily be made another function I guess)

Quote:
Regarding sys_memcpy: is this different from 0.3? I haven't had any memcpy issues before. That being said, I have rolled my own a couple of times because the existing memcpy is very slow. I'd be happy to volunteer a new implementation in assembly if you can get it incorporated to the base SDK.
Blame says it hasn't changed in 9 years, looks like it was always that way. But the compiler might be doing something different now. I'm happy to include a new implementation though.
@Lephe

I found out about Bfile_GetBlockAddress via Simon Lothar in Casiopeia

It returns direct pointer access to the 4 kb blocks of the given open file handle. Direct memory copy then works and is then typically much faster than file read. File read probably does a bit of buffering and copying around. Very often, these block pointers are sequential so you can do even larger blocks at a time if desired, especially via DMA in some applications (not mine). It has two very strict limitations:

1) Read only
2) If ANY changes to the file system happen in the mean time, the block addresses become invalid. So you have to be careful with your syscalls while they are in flight. Even FindFile can cause some file reorganization, I've found.
Ok cool Tari, I will work on a new implementation and smoke test it with NESizm and Prizoop first.

Specifically the OS implementation does a very naive single byte copy and doesn't bother to check alignment to see if the copy can be done by word or dword. I have my own x32 byte memcpy routine already that is at least 4x as fast.
I see, so it's basically the information returned by BFile_FindFirst() except that it works for an open file. Very interesting, thanks!

-

Not sure if it can help, but I have written and thoroughly validated an efficient memcpy that can use 4-aligned moves, unaligned longword moves, and 2-aligned moves for all (including unaligned) inputs. On SH4 it uses longword accesses even without alignment. (The test is here, it checks various buffers sizes for every combination of source and destination alignment, and a few other parameters). There's also memmove, memcmp and memset in the same style.

By the way, if this is possible within the PrizmSDK, the DMA provides a significant speed boost in 32-aligned memcpy and memset operations. For instance a full VRAM memset (177k) takes only 2.5ms using a small source buffer in ILRAM, as opposed to 6.1ms with a hand-written + unrolled assembly loop.
@Lephe that's a great start for the generic memcpy. Can I suggest an improvement and then we can all use the same one?

For larger aligned copies this memcpy routine is still fairly slow. Before starting the dword loop, it could copy 32 bytes at a time until less than 32 bytes are left. The 32-byte chunk copy using an unrolled routine using indexed loads similar to my fast 32-byte block copy here would be more than twice as fast.

Re: Bfile_GetBlockAddress the biggest difference from FindFirst is this returns the block by block pointer, since the file isn't necessarily contiguous in ROM.

Re: DMA. I use ILRAM in a similar way in both NESizm and Prizoop to obtain high frame rates, you should look at it. Specifically, I found that using DMA direct to the display from the ILRAM appears to circumvent the main bus altogether. It's how I got to 60 FPS on the CG-10 with a full screen refresh.
To clarify, by DMA direct to device I mean direct to the hardware interface, NOT to VRAM.

Specifically here:
https://github.com/tswilliamson/nesizm/blob/master/src/scanline_dma.cpp

The scan buffer pointer is mapped directly to ILRAM, and you can see the DMA doing the display interface put.
Yes sure, I definitely want the fastest version in my programs! ;D

That's an interesting option! It does seem much faster, though did you time it? I have found more than once that the RAM bandwidth is not enough to match the CPU's processing speed, for instance my image rendering routine goes at the same speed regardless of whether I do an extra comparison for transparency at each pixel (which even includes a branch!).

I'll be sure to try it and time it either way.

Quote:
Re: DMA. I use ILRAM in a similar way in both NESizm and Prizoop to obtain high frame rates, you should look at it. Specifically, I found that using DMA direct to the display from the ILRAM appears to circumvent the main bus altogether. It's how I got to 60 FPS on the CG-10 with a full screen refresh.

I do use the DMA a lot. Smile However the ILRAM is way too small to hold a complete VRAM (4k). Do you stream your frames?

(Also your scanGroup array points to X and Y memory (8k each), it's not on the same bus as the ILRAM as it can be efficiently used by the DSP. Maybe this is just a naming difference.)
No you are right, my bad, it's the X/Y memory that circumvents the bus. It is very small memory area but it's a convenience of an emulator, as the emulation is rendered in individual scan lines, but I do think there are ways to utilize the same approach for other rendering techniques, or special effects!

I did time the memcpy, though that was compared to the sys_memcpy or a naive c loop, against which it is definitely much faster Smile The cpu has limited caching so I wouldn't think a mem wait would hit every loop. I wrote this routine to specifically copy from the BFile blocks into RAM for my NES cart data. Let me know if it doesn't show any improvement on your end.
So I implemented and timed the version with your change. As I half-expected, it's not any faster in standard RAM; the memcpy() for 16 kB of VRAM into another 16 kB of VRAM takes 1605 s with both versions. That's 10.2 MB/s, for reference.

It seems pretty clear to me that memory bandwidth is the limiting factor. For instance, notice the RAW dependency between the load and the write. Because of this dependency, the pipeline has to wait for the WB stage of the read before executing the ID stage of the write, which is a 3-cycle loss. But rearranging the order of instructions to alternate operations on the memory and on the size counter, which improves the pipeline, does not make any difference, with a consistent 1605 s.

However, that's another story in the faster memory areas. My version copies at 33.0 MB/s in the ILRAM and yours reaches 36.6 MB/s. I did not foresee this speed difference in memory access, and your optimization uses it elegantly!

Measures for the XRAM and YRAM are even more extreme. Your version consistently takes 20 s to copy 4 kB (the test program copies the first half of each area into the second half). (That's 200 MB/s, my version only reached 90 MB/s.) This is a dubious measure at first, because at the normal CPU frequency of 113 MHz, that's only 2260 CPU cycles, which is barely enough to read 2*1024 longwords plus 212 cycles of overhead. Since the RAM bus is free for code access, this would technically be feasible, and the overhead of your 32-byte loop is very small. But I don't understand yet how the pipeline can be tight enough for this level of speed.

If you have any other insight to share, that'd be appreciated. The performance measurement method is just the TCNT register of a TMU set at Pϕ/4, which is a resolution of 0.14 s in my testing conditions; it has been very accurate and consistent for the few months I've been using it.
This sounds right to me. I did only compare to a c implemented memcpy and the sys_memcpy, so it's unsurprising that your method is fine for a standard RAM-only use case, but less fast than the rest.

Hmmm.. like I said there is a terrible little cache in this CPU, I think it is 32 bytes? So we might be "thrashing the cache" by moving back and forth. What happens if we trash more registers and keep access contiguous, something like this? Using whichever registers.. I just adapted the inline code sample I had:


Code:
mov.l @%[src]+,r0
mov.l @%[src]+,r2
mov.l @%[src]+,r3
mov.l @%[src]+,r4
mov.l r0,@(0,%[dest])
mov.l r2,@(4,%[dest])
mov.l r3,@(8,%[dest])
mov.l r4,@(12,%[dest])
oh sorry, I see what you were asking.

Don't forget the instruction read itself. There is a separate, also very small, cache for CPU instructions, which is invalidated immediately on a branch. So long continuous functions like an unrolled loop that operator on the X and Y memory are actually most of the time inhabiting on the chip itself with no outside access!
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 2 of 3
» All times are GMT - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement