Hello!
I've been making this simple 3D racing game where you can connect another calculator to play together.
I made it for the fx-CG50, but it should work on other fx-CG calculators.



More information and source code is available at https://github.com/duarteapcoelho/prizm_racing
You can get it from the github page or from here: http://ceme.tech/DL2319
Great work on this, duartec! You were receiving quite a number of kudos in SAX yesterday when this was accepted into the archives, and you got us talking about 3D or pseudo-3D racing games like Lotus Turbo Challenge for the TI-86. Good luck with any of the improvements you mention in the Github repo that you decide to pursue.
Thanks!
I've been looking into the improvements you mentioned, but it looks like they don't have as much impact on performance as I thought (in the best case, from 21 to 25 FPS), and I don't even know if it's possible to clear the whole screen using DMA, so I probalby won't be changing much, if anything.

I'd like to point out that it's possible to make games much better than this for a calculator.
- The graphics are drawn using triangles, just like modern 3D games, so it's possible to have more things above the ground.
- It's also possible to have lots of objects (you can see the cones in the screenshots), as long as they aren't too complex (>100 triangles) or big. A single complex or big object is fine though (the car is made of 230 triangles). You could also use other techniques like culling to improve performance for more complex scenes.
- The time spent on game logic is insignificant, so it's possible to make it much more complex.
You can look at my code, but there's a lot to be improved, especially for more complex games.

If anyone has a suggestion fot this game, I'd like to hear it here or in a Github issue. I won't add anything too big like a track editor though.
Have you considered compromising the visuals a bit by using billboards instead of 3D models for the cones and any other trackside props you'd like the game to have?
Disabling the cones doesn't change the framerate measurably, so using billboards wouldn't improve performance much either. Also, replacing the car with images wouldn't work for multiplayer and wouldn't allow some effects like having the camera follow the car smoothly.
Looks amazing, do you have any video of it in action?
I am very impressed by this game, although I haven't gotten the chance to test its multiplayer feature yet. I like the shading especially.

On the fx-CG10 I prefer using the numpad to accelerate, because for some reasons when pressing up on that calculator it often presses left or right inadvertently due to poor calculator d-pad design so the car keeps turning by itself when accelerating.
tr1p1ea wrote:
Looks amazing, do you have any video of it in action?

I added a gif.
Also, I'm trying to port it to the TI 84 plus CE, so that more people can play it. I don't know if multiplayer will work on it though.
Brilliant! Your 3D graphics look awesome.

Quote:
(...) and I don't even know if it's possible to clear the whole screen using DMA

It is, and it should take about 2.5 ms.

I'd be happy to contribute some performance insights! I helped Heath get this lovingly similar prototype to 60 FPS and I'm sure there are possible improvements here too!

I would love a more detailed explanation of the technical stuff going on, too.
That would be great!

Is using DMA to clear the depth buffer possible too? It's a char array defined in main.
Also, do you know anything about the indexed color mode mentioned in the prizm wiki, or about whether it's possible to draw directly to the screen, to avoid having to copy the VRAM?
What about gint? Could it improve performance?

There is some information about the rendering in the GitHub, but if there's anything else you'd like to know, feel free to ask.
Quote:
Is using DMA to clear the depth buffer possible too? It's a char array defined in main.

Yup, assuming it's 32-aligned for best results.

Quote:
Also, do you know anything about the indexed color mode mentioned in the prizm wiki, or about whether it's possible to draw directly to the screen, to avoid having to copy the VRAM?

The display has this "indexed color mode" which is basically an 8 color mode. I believe its primary purpose is to reduce power consumption, because there way less data to transfer. Most of the OS uses it. I never played with it, as 8 colors is too limited for my taste...

Drawing directly to the screen would be terribly slow. It's much faster to render to VRAM first and send that VRAM!

Quote:
What about gint? Could it improve performance?

I think a fair assessment would go like this:
- In the short term it helps by giving you helpful APIs. For instance you get a dma_memset() function and a fast image renderer transparently. (Though you need to set up the SDK and everything.)
- In the medium term it won't help because you'll be optimizing your computations, memory access patterns, and 3D renderer, and these are just raw programming.
- If you want to scale to the highest degree, it will become most valuable by giving you custom drivers, cycle-accurate performance metrics, malloc() on fast memory, things like this.

The best I got for close-to-3D rendering is a screen full of triangles rendering at 30 FPS (no overclock) using no-VRAM no-DMA wizardry that gint essentially enabled. (With just a couple platforms it'd go up to 100 FPS.)

Your rasterizer looks quite clean, I'll certainly be looking for inspiration... I'd really like to get my triangle shader to render depth so I can get both a depth buffer and texturing, but I'm not super comfortable with the fine details yet, so having a reference will certainly help a ton!

Anyway, do you have any profiling info on your program so far? What's taking the longest time? I'm suspecting filling the road and grass areas might take longer than what we'd reasonably expect compared to the complicated car model.
Lephe wrote:
Anyway, do you have any profiling info on your program so far?

The RTC doesn't have enough accuracy to properly profile it, but there are some things I could notice:
- Clearing the screen takes a lot of time
- Almost all of the time is spent on rendering
I'll try using gint to profile better.

Here are a few optimizations i'm going to try:
- Use your dma_memset function for clearing
- Clear only a portion of the screen (the lower region of the sky and upper region of the grass)
- Draw grass without using the 3D renderer
- Use a better algorithm for rasterizing triangles (I'm looking into one which is based on Bresenham's line algorithm, and does all the rendering using only integers)

The triangle rendering loop looks something like this (in pseudocode):

Code:
for each y:
  calculate minimum and maximum X
  for x in range(min, max):
    if z < getDepthBuffer(x, y):
      setDepthBuffer(x, y, z)
      drawPixel(x, y)

You can see how between drawing a pixel, it reads and writes from the depth buffer.
I'll try improving this too.

Lephe wrote:
I'd really like to get my triangle shader to render depth so I can get both a depth buffer and texturing

Note that I'm not calculating the correct depth for each pixel, and use the average instead (which isn't perfect, but it's good enough for me), as that would require perspective-correct interpolation.
For perspective-correct interpolation, which is needed for this and for anything where colors change per pixel, like texturing and point/spot/specular lighting, I read that you need to use barycentric coordinates. I tried implementing it, but in the end, I couldn't get it to work well because of fixed point numbers, and it was too slow. There were lots of calculations being done per pixel.
There were still some things that could be optimized, so maybe I'll try again soon.
This is amazing I would love to see this ported to the Ti84 plus CE.
Quote:
The RTC doesn't have enough accuracy to properly profile it, but there are some things I could notice:
- Clearing the screen takes a lot of time
- Almost all of the time is spent on rendering
I'll try using gint to profile better.

Well 128 Hz is a start I guess. If you're trying gint you can start a high-precision timer, determine its clock speed, and do the same as with the RTC. There's a library that implements that.

Quote:
Here are a few optimizations i'm going to try:
- Use your dma_memset function for clearing
- Clear only a portion of the screen (the lower region of the sky and upper region of the grass)
- Draw grass without using the 3D renderer
- Use a better algorithm for rasterizing triangles (I'm looking into one which is based on Bresenham's line algorithm, and does all the rendering using only integers)

I'm not surprised rendering is the bottleneck (it always is tbh). Clearing the sky and ground an opaque color with dma_memset() will be quite a bit faster than doing it with CPU (between 1.5x and 3x faster) and obviously even faster than using the rasterizer. Note that for dma_memset() you need the buffer to be 32-aligned and of size multiple of 32, which means for the VRAM you must align on groups of 4 lines. (You can always fill in the rest by hand.)

Quote:
You can see how between drawing a pixel, it reads and writes from the depth buffer.
I'll try improving this too.

I'm not sure you can do much about that. Note that writes are about 10x slower than reads due to bad cache behavior. I think you can just try and improve your memory access patterns first. Obviously eliminating redundant index computations is going to help, especially with multiplications.

At default overclock level the RAM can process a write every 13-14 clock cycles. Reads are fast. So if you can streamline iterations on both the z-buffer and the VRAM you'll likely improve the pixel output rate of the rasterizer. Looking at your code it is clearly intended to do that already, but I'd rather check the assembler code than blindly trust the optimizer to get everything correct. (That particular rabbit hole ends with this kind of code xP).

Quote:
For perspective-correct interpolation, which is needed for this and for anything where colors change per pixel, like texturing and point/spot/specular lighting, I read that you need to use barycentric coordinates. I tried implementing it, but in the end, I couldn't get it to work well because of fixed point numbers, and it was too slow. There were lots of calculations being done per pixel.

My triangle renderer (linked above) does use barycentric coordinates with fixed-point already. It doesn't calculate the depth yet though, and perspective-correct depth will indeed be required. This will definitely lose some performance, but I think it's manageable especially on half-resolution. I've also played with sampling based on fixed-point values in some image rotation code, so I'm hoping it'll work out.
I only optimized a bit (removed the index computations you mentioned), but here are some performance measurements:
- Clearing the screen and depth buffer (no DMA yet): 4 ticks
- Drawing the car: 1 tick
- Drawing the track: 1 tick
- Drawing cones: 1 tick

Rendering the whole frame takes 8 ticks, and half the time is spent clearing the screen!
If dma_memset is 2x faster, it will take 6 ticks to render, which is 21 fps.

Is there any dma_memset function that I can use without gint?
Aha! Interesting that it takes so long. You seem to be clearing the depth buffer manually with 8-bit accesses? According to my benchmarks that is about 8.24 MB/s whereas a long-based clear would be close to 30 MB/s. Make sure to do that or use memset() which is likely to do it properly for you.

As for the DMA version, it's a bit unclear. The only DMA code I am aware of in libfxcg is SimLo's background display update. You can use the DMA in libfxcg (provided you don't use files at the same time) but you'd have to drive it yourself. If you want to try that you can reuse gint's driver because dma_memset() is a basic use case so you really only need two non-trivial functions, dma_transfer_atomic() and dma_setup(). The <gint/mpu/dma.h> header can be copied verbatim. Also compile with -fstrict-volatile-bitfields to avoid any compiler optimization mishaps.

I forgot to mention earlier that gint doesn't have a serial driver yet so if you try to port you'll have to world switch back to the OS for serial communication.
Lephe wrote:
You seem to be clearing the depth buffer manually with 8-bit accesses?

I said the depth buffer was 8 bit, but it's actually an array of ints (32 bit).

Now I'm trying to use gint but, when I allocate the depth buffer, I get an error. I tried three ways of allocating it:
- In the stack, it doesn't work unless I decrease the resolution to half.
- With malloc and kmalloc, it doesn't work even if decrease the resolution.

Maybe I could use the memory normally used for double/triple buffering?

Or maybe I'm missing something about kmalloc. What I'm doing is something like this:

Code:
int *depthBuffer = (int*) kmalloc(RENDER_WIDTH*RENDER_HEIGHT*sizeof(int), NULL);
Aha 8-bit z-buffer sounded like it would be too good to give such a clean render, I guess that's why. xD

Ok, so that is either 384×216×4 (331 kB) or 396×224×4 (355 kB). It's large enough that you might run into some limits.

Just so we're clear, the memory you get with libfxcg is essentially:
- 512 kB for your data segment and stack (unused section lost)
- 128 kB for malloc()

The way gint is set up gives you:
- 346 kB for your data segment (unused section goes to malloc() as _uram)
- 16 kB for stack
- 358 kB (_ostk) + 128 kB for malloc()

So you might run out of memory... although with a smaller resolution or with a static allocation, it should just work. What kind of error do you get with a static allocation? Linker error? Runtime error?

Maybe try to malloc() this block as early as possible. We can check how much memory left you have at runtime if needed (see below).

I also pushed a commit (use the dev branch) to rebalance the memory into the following configuration:
- 491 kB for your data segment (unused section goes to malloc() as _uram)
- 16 kB for stack
- 180 kB (_ostk) + 128 kB for malloc()

Which should make it easier to keep a large enough block to get your buffer. Here is how to check how much of the 491 kB malloc() got, and how much is available at any given time:

Code:
#include <gint/kmalloc.h>

kmalloc_arena_t *arena = kmalloc_get_arena("_uram");
kmalloc_gint_stats_t *stats = kmalloc_get_gint_stats(arena);

int total_size = arena->end - arena->start; /* bytes */
int free_memory = stats->free_memory; /* bytes */

You can check the other main arena by substituting "_uram" for "_ostk" in the arena name.
After some testing, I noticed that there is enough memory for the depth buffer, but there isn't enough for the models.
I tested _uram and _ostk, and kmalloc was only using _ostk.

I fixed it by making both the depth buffer and the models static, and now they use memory from both segments.
Now, how do I actually use dma_memset to clear the depth buffer?

The depth buffer is defined like this:

Code:
static fp depthBuffer[RENDER_WIDTH*RENDER_HEIGHT]

"fp" is the fixed point class which only has an int.

If I just do this:

Code:
dma_memset((uint32_t*)depthBuffer, *((uint32_t*)&value), RENDER_WIDTH*RENDER_HEIGHT);

then it crashes and resets the calculator.

If I define it as "static GALIGNED(32) fp ...", then it doesn't crash, but nothing is set.
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 1 of 3
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement