I'm creating a textured ray-cast engine for the Prizm and I think this point-plot routine is a bottleneck on the speed. Would there be a significant speed boost to drawing if the plot routine below was written in assembly?


Code:

inline void plot(int x0, int y0, int color) {
   char* VRAM = (char*)0xA8000000;
   VRAM += 2*(y0*LCD_WIDTH_PX + x0);
   *(VRAM++) = (color&0x0000FF00)>>8;
   *(VRAM++) = (color&0x000000FF);
   return;
}

// this is used to draw scaled stripes of an image on the screen
inline void drawStripe(int x, int y1, int y2, int lineHeight, int texX) {
   int i;
   for(i = y1; i<y2; i++)
   {
      int d = i * 256 - h * 128 + lineHeight * 128;
        int texY = ((d * texHeight) / lineHeight) / 256;
      color_t color = tex1[texHeight * texY + texX];
      plot(x,i,color);
   }
}


Is there any documentation about Prizm assembly or online resources I could use to try and re-write this? I don't have any knowledge about assembly, so I'm unsure where I would start with this.
Actually you can make your C code much better try this

Code:

inline void plot(int x0, int y0, unsigned short color) {
   unsigned short* VRAM = (unsigned short*)0xA8000000;
   VRAM += (y0*LCD_WIDTH_PX + x0);
   *VRAM = color
   return;
}

I see that plot code you posted to be very wide spread I never understood where it came from. The code I posted does the same thing but faster.
ProgrammerNerd wrote:
I see that plot code you posted to be very wide spread I never understood where it came from.

Probably from the "useful routines" thread.

ProgrammerNerd wrote:
The code I posted does the same thing but faster.


I have replaced the old code with this one in my Utilities add-in, even if drawing, in this add-in, doesn't need to be really fast (hence I'm not making it inline as that would result in a binary size increase).
Thanks for the better optimized routine ProgrammerNerd. I'm still interested if anyone has information on Prizm assembly, however. Would the speed increase make it worth learning?
If you're plotting a crapload of pixels, just inline the memory operation instead of using the function. That'll save more cycles than converting to Assembly.
Ashbad wrote:
If you're plotting a crapload of pixels, just inline the memory operation instead of using the function. That'll save more cycles than converting to Assembly.

See GCC 6.39 An Inline Function is As Fast As a Macro

ProgrammerNerd wrote:
Actually you can make your C code much better try this

Code:

inline void plot(int x0, int y0, unsigned short color) {
   unsigned short* VRAM = (unsigned short*)0xA8000000;
   VRAM += (y0*LCD_WIDTH_PX + x0);
   *VRAM = color
   return;
}

I see that plot code you posted to be very wide spread I never understood where it came from. The code I posted does the same thing but faster.


GCC 6.39 wrote:
When an inline function is not static, then the compiler must assume that there may be calls from other source files; since a global symbol can be defined only once in any program, the function must not be defined in the other source files, so the calls therein cannot be integrated. Therefore, a non-static inline function is always compiled on its own in the usual fashion.


From the above, use inline static void plot(int x0, int y0, unsigned short color) {/* ... */} instead.
Thank you for the clarification; there was a notable speed increase with that change.
Hi, I'm new here, and I just register here to reply Smile
Just to say that with some Plančte-Casio members, we've been working on an asm pixel plot routine...
Seeing some of the posts above, I don't know if it would be faster but I just say it exists Wink


Code:
void pixel_asm(unsigned short x, unsigned short y, unsigned short color)
{
    __asm__("mov.w   .width,r1\n"    // width -> r1
            "mulu.w   r1,r5\n"        // width * y -> macl    : Get the y pixel offset
            "mov.l   .vram,r1\n"     // vram -> r1           : Sets the VRAM adres in r1
            "sts   macl,r0\n"      // macl -> r0           : Gets macl in r0
            "add   r4,r0\n"        // r0 + x -> r0         : Get the pixel offset in the VRAM
            "shll   r0\n"           // r0 = r0 <<1          : Multiplies by two to get the short offset
            "mov.w   r6,@(r0,r1)\n"  // *(r0+r1) = color          : Sets the pixel to the deisred color
            "rts\n"   
            "clrmac\n"
            ".align 1\n"
            ".width:\n"
            "    .short 384\n"
            "     .align 2\n"
            ".vram:\n"
            "    .long -1476395008\n"
            "    .align 1\n");
}


Perheaps this isn't what you'd like to know, but we can gain almost 25 % of time by using it instead of the first you said (the one not inlined), depending of the purpose ...
For this purpose, it would perheaps not be better than inlined, but in some other case, i don't know...

So, that's all for this post ( Very Happy ), I just wanted to show you this and why not, having some feedbacks Smile .

Ps : sorry for my potential bad English, I tried to do my best, but i had some hesitation Rolling Eyes .
Could this be inlined? That might give an even better performance boost.
merthsoft wrote:
Could this be inlined? That might give an even better performance boost.
I'd actually expect simply inlining the original routine to give at least a 25% performance boost. Also, I was the author of that original one; it was written before we had a color_t type. ProgrammerNerd, I believe GCC optimizes my two-byte write pair to be equivalent to a single short write anyway. Smile
Yes, they already said inlining the original gave a performance boost... I was wondering if inlining the ASM would then give an even better performance boost.
KermMartian wrote:
ProgrammerNerd, I believe GCC optimizes my two-byte write pair to be equivalent to a single short write anyway. Smile


After I replaced your routine with his, the CPU usage of the Prizm emulator (which is related to the time the emulated SH CPU keeps processing instead of in a simple loop waiting for a key in GetKey, per total time) on the main screen of my Utilities add-in (on which the analog clock calls plot() a lot per draw and gets drawn multiple times per second), dropped significantly - the fan of my computer used to kick in after a while when at that screen, with the new routine it no longer kicks in, or does so at a slower speed.
I have reverted to the old routine again, only to see the fan get to work sooner again. It's back at ProgrammerNerd's one. So there's definitely a difference between the two versions, even after compiler optimizations.
The obvious way to do a scientific comparison is to have GCC emit assembly for each option (-S), and compare them.

I suspect a pure-C version with correct inlining and optimization will perform better than an inline-assembly version, since when inlined the constant folding pass should be able to generate addresses more efficiently rather than recomputing on every call. In degenerate cases where you write to a fixed location, it should optimize to a store to the appropriate address.
I'd imagine that this will be even faster. It is untested but should work.

Code:

// this is used to draw scaled stripes of an image on the screen
inline static void drawStripe(int x, int y1, int y2, int lineHeight, int texX) {
   unsigned short * V=(unsigned short*)0xA8000000;
   V+=x;
   V+=y1*384;
   int i;
   for(i = y1; i<y2; i++)
   {
      int d = i * 256 - h * 128 + lineHeight * 128;
        int texY = ((d * texHeight) / lineHeight) / 256;
      unsigned short color = tex1[texHeight * texY + texX];
      *V=color;
      V+=384;
   }
}
I have four different functions which draw to the VRAM; would it be a good idea (e.g. more efficient) to make a global variable that stores the VRAM's memory address instead of recreating it every function call?

Something like this possibly?

Code:

static unsigned short * const VRAM = (unsigned short*) 0xA8000000;
It won't matter; any decent optimizing compiler will replace the usage of the constant with an immediate value in the final code.

Edit: Just make sure it's a const.
When it is created in the drawing functions? Also, did I put the 'const' in the right place there?

Edit:
Never mind about that; I just realized that what I was suggesting doesn't really make any sense in the first place.
Any progress on getting some solid optimizations in, Ygyax? Did you compare the output assembly generated from the two varieties, as Tari suggested?
Yes, there has been some rather significant optimization. I opted to go with the static inlined C version of the pixel plotting routine instead of the assembly one, and I have converted all of the floating point math to fixed point. I also created a sine table for Sine and Cosine calculations. I'd like to release a demo of it soon, as it runs surprisingly well without overclocking, but there is still one major bug I need to fix. I already went into detail about it here though, so I'll leave the discussion about it there.
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 1 of 1
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement