I'm happy to announce Beta 4! This beta includes initial TI-89 support, remote terminal support (using two calculators and "uterm"), and user login.

Download it now, or read the release notes.

BTW, thanks for the suggestion, Progbeard. I might at one point re-design the font completely, but that is currently a low priority.
I was going to mention something about the font as well, but I'll keep it to myself if that's going to be a future improvement. Nice job on terminal emulation and login.
KermMartian wrote:
I was going to mention something about the font as well, but I'll keep it to myself if that's going to be a future improvement. Nice job on terminal emulation and login.

Thanks. I am really only thinking about changing the font. I don't have any specific plans to do so at this time. What exactly did you have in mind? I'm open to suggestions if they lead to a more readable font (and as long as they fit in with the rest of the glyph set).
I dunno, have you considered either a (second optional, or replacement) 3x5 font? Or having a readability mode where your terminal displays in nice readable 3x4 all-caps?
KermMartian wrote:
I dunno, have you considered either a (second optional, or replacement) 3x5 font? Or having a readability mode where your terminal displays in nice readable 3x4 all-caps?

Do you mean replaceable at runtime from userspace, or by re-compiling? Right now the kernel would have to be recompiled to use a different font. I might add the ability to load fonts dynamically, but that isn't very high priority.

As for a readability mode, I probably won't do that. I'd like to get loadable fonts (see above) working, so the user can load any font they want from userspace. Loadable fonts will be much more flexible than a readability mode. I'll have to work out how to load glyphs for all of the character sets (VT100 supports a few different sets (eg, US, UK, graphics, technical), and so does my terminal emulator). If I choose to support double-width and double-height characters as the VT100 does, that will add more complexity. :/
Christopher, sounds good, I think replaceable fonts would be fine, although needless to say much more complex than compiled-in fonts. I definitely would support you adding that when you get to it.
I recently wrote a 64-bit multiplication routine which takes two 64-bit unsigned integers and outputs a 128-bit unsigned integer. It currently executes in 2326 cycles, about half of which are taken by the sixteen 16x16 bit multiply instructions. I was wondering if anyone on this forum knows where I could find a faster 64-bit multiplication routine for 68k. If anyone could direct me to another forum that is more geared toward the 68k than this site or Omnimaga, that would be great too.

I'd ask on the TICTHQ forum, but it's oriented mostly toward TIGCC, and it looks mostly dead to me. Smile

FWIW, I also posted this on Omnimaga but didn't receive any helpful replies yet: http://ourl.ca/12359
Yeah, the TICT/TIGCC forum is mostly dead. Nowadays, the TI-68k action happens on Omnimaga or #ti.

Please post the code, so that I can have a look, and contact Samuel Stearley to point him to the Omnimaga topic or this one: he's more of a 68000 ASM wizard than I am Smile
I am currently writing another version that uses %d4-%d7 for the accumulated result. It'll be tricky since %d0-%d3 are the operands themselves, and I need another register or two for temporary calculations. I'm planning on first calculating all partial products that need %d3 so that can be freed first. I still might have to unload a register into an address register so I have enough data registers available for the second half of partial products. Also, I won't be able to do as many staggered long additions, but the speed gains by using registers vs the stack should make up for that.

Anyway, here's the code, which includes the C interface mul64:

Code:
.if 0
HAHAHA! Sweet optimization note: any two longs (as a result of multiplying two words) can be added without carry if they are staggered. The largest product of two 16-bit unsigned values is 0xFFFE0001, so staggering the adds like so will never carry:

FFFE00010000
0000FFFE0001
------------
FFFEFFFF0001

These can be chained like so without carry.
.endif

| this routine currently uses the following partial product layout:
|exploiting the no-carry trick noted above:
|first partial product:
|0055ddff +
|.44ccee.
|second partial product:
|.1166bb. +
|..88aa..
|third partial product:
|..2277.. +
|...99...
|fourth partial product:
|...33...

| multiply unsigned 64-bit integers
| return unsigned 128-bit integer
| input:
|  x = %d0:%d1
|  y = %d2:%d3
| output:
|  x*y = %d0:%d1:%d2:%d3
| 2326 cycles (74 cycles per mulu)
mulu64:
   | save registers
   movem.l   %d4/%a0-%a1,-(%sp)

|// first partial product
|// 0055ddff
|push:
|ff << 0  dh
|dd << 2  df
|55 << 4  bf
|00 << 6  ae

   | ab:cd d0:d1
   | ef:gh d2:d3
   
   | dh
   move.w   %d1,%d4
   mulu.w   %d3,%d4      | dh
   move.l   %d4,-(%sp)

   | df
   move.w   %d1,%d4
   mulu.w   %d2,%d4      | df
   move.l   %d4,-(%sp)

   | bf
   move.w   %d0,%d4
   mulu.w   %d2,%d4      | bf
   move.l   %d4,-(%sp)

   | ae
   swap   %d2
   | ab:cd d0:d1
   | fe:gh d2:d3
   move.l   %d0,%d4
   swap   %d4
   mulu.w   %d2,%d4      | ae
   move.l   %d4,-(%sp)

|
|// .44ccee.
|add without carry:
|ee << 1  dg
|cc << 3  de
|44 << 5  be

   | dg
   move.w   %d1,%d4
   swap   %d3
   | ab:cd d0:d1
   | fe:hg d2:d3
   mulu.w   %d3,%d4      | dg
   add.l   %d4,5*2(%sp)

   | de
   move.w   %d1,%d4
   mulu.w   %d2,%d4      | de
   add.l   %d4,3*2(%sp)

   | be
   move.w   %d0,%d4
   mulu.w   %d2,%d4      | be
   add.l   %d4,1*2(%sp)


|
|// second partial product
|// .1166bb.
|push:
| 0x0 << 0  // don't really need this
|bb << 1  ch
|66 << 3  bg
|11 << 5  af
| 0x0 << 7

   | ch
   swap   %d1
   | ab:dc d0:d1
   | fe:hg d2:d3
   move.l   %d3,%d4
   swap   %d4
   mulu.w   %d1,%d4      | ch
   move.l   %d4,-(%sp)

   | bg
   move.w   %d0,%d4
   mulu.w   %d3,%d4      | bg
   move.l   %d4,-(%sp)

   | af
   swap   %d0
   | ba:dc d0:d1
   | fe:hg d2:d3
   move.l   %d2,%d4
   swap   %d4
   mulu.w   %d0,%d4      | af
   move.l   %d4,-(%sp)

   clr.w   -(%sp)

|
|// ..88aa..
|add without carry:
|aa << 2  cg
|88 << 4  ce

   | cg
   move   %d1,%d4
   mulu.w   %d3,%d4      | cg
   add.l   %d4,4*2(%sp)

   | ce
   move   %d1,%d4
   mulu.w   %d2,%d4      | ce
   add.l   %d4,2*2(%sp)
|
|add with carry (upper 7 words or 3.5 longs)

   lea   7*2(%sp),%a1      | second partial product
   lea   7*2+7*2(%sp),%a0   | result
   addx.l   -(%a1),-(%a0)
   addx.l   -(%a1),-(%a0)
   addx.l   -(%a1),-(%a0)
   addx.w   -(%a1),-(%a0)

| pop off our second partial product
   lea   7*2(%sp),%sp

|
|// third partial product
|// ..2277..
|push:
|0x00 << 0  // don't really need this
|77 << 2  bh
|22 << 4  ag
|0x00 << 6

   | bh
   move.l   %d0,%d4
   swap   %d4
   swap   %d3
   | ba:dc d0:d1
   | fe:gh d2:d3
   mulu.w   %d3,%d4      | bh
   move.l   %d4,-(%sp)

   | ag
   swap   %d3
   | ba:dc d0:d1
   | fe:hg d2:d3
   move.w   %d0,%d4
   mulu.w   %d3,%d4      | ag
   move.l   %d4,-(%sp)

   clr.l   -(%sp)

|
|// ...99...
|add without carry:
|99 << 3  cf

   | cf
   move.w   %d1,%d4
   swap   %d2
   | ba:dc d0:d1
   | ef:hg d2:d3
   mulu.w   %d2,%d4      | cf
   add.l   %d4,3*2(%sp)


|
|add with carry (upper 6 words or 3 longs)
   lea   6*2(%sp),%a1      | third partial product
   lea   6*2+6*2(%sp),%a0   | result
   addx.l   -(%a1),-(%a0)
   addx.l   -(%a1),-(%a0)
   addx.l   -(%a1),-(%a0)

| pop off the third partial product
   lea   6*2(%sp),%sp

|
|// fourth partial product
|// ...33...
|push:
|0x00 << 0  // don't really need this
|0x0 << 2   // or this
|33 << 3  ah
|0x0 << 5
|0x00 << 6

   | ah
   move   %d0,%d4
   swap   %d3
   | ba:dc d0:d1
   | ef:gh d2:d3
   mulu.w   %d3,%d4      | ah
   move.l   %d4,-(%sp)

   clr.w   -(%sp)
   clr.l   -(%sp)

|
|add with carry (upper 5 words or 2.5 longs)
   lea   5*2(%sp),%a1      | fourth partial product
   lea   5*2+5*2(%sp),%a0   | result
   addx.l   -(%a1),-(%a0)
   addx.l   -(%a1),-(%a0)
   addx.w   -(%a1),-(%a0)

| pop off fourth partial product
   lea   5*2(%sp),%sp

   | pop off result and saved registers
   movem.l   (%sp)+,%d0-%d4/%a0-%a1
   rts

.global mul64
| void mul64(int64 *a, int64 *b, int128 *result);
mul64:
   | save
   move.l   %d3,-(%sp)

   move.l   4+4(%sp),%a0
   move.l   4+4+4(%sp),%a1
   movem.l   (%a0)+,%d0-%d1
   movem.l   (%a1)+,%d2-%d3

   jbsr   mulu64

   move.l   4+4+4+4(%sp),%a0
   lea   4*4(%a0),%a0
   movem.l   %d0-%d3,-(%a0)

   | restore
   move.l   (%sp)+,%d3
   rts
Quote:
I am currently writing another version that uses %d4-%d7 for the accumulated result. It'll be tricky since %d0-%d3 are the operands themselves, and I need another register or two for temporary calculations. I'm planning on first calculating all partial products that need %d3 so that can be freed first. I still might have to unload a register into an address register so I have enough data registers available for the second half of partial products. Also, I won't be able to do as many staggered long additions, but the speed gains by using registers vs the stack should make up for that.

Completely agreed Smile
Either kick out d3, or d0, out of the way first, so as to have a spare data register.

To avoid operations on "unaligned" doublewords (16 bits in a register, 16 bits in another one)... is it possible (untested, wild idea) to perform computations in the following order ?
* on the one side:
|0055ddff +
|..88aa.. +
|..2277..
* on the other side:
|.44ccee. +
|.1166bb. +
|...99...
|...33...
* at the end, add both parts together ?



Two notes on the current routine:
* clr reads the destination operand before clearing it... but at only four occurrences of clr.w/.l, I'm not sure whether saving, moveq #0 and restoring an additional dn register in order to be able to use move.w/.l %dn,-(%sp) will end up being faster than clr.w/.l -(%sp)...
* movem.l from registers to memory is allowed with (%an) operand, and according to the 68000 Programmers Reference manual,
Quote:
Selecting the addressing mode also selects the mode of operation of the MOVEM instruction, and only the control modes, the predecrement mode, and the postincrement mode are valid. If the effective address is specified by one of the control modes, the registers are transferred starting at the specified address, and the address is incremented by the operand length (2 or 4) following each transfer. The order of the registers is from D0 to D7, then from A0 to A7.

So I think that lea 4*4(%a0),%a0; movem.l %d0-%d3,-(%a0) could be movem.l %d0-%d3,(%a0).
Lionel Debroux wrote:
To avoid operations on "unaligned" doublewords (16 bits in a register, 16 bits in another one)... is it possible (untested, wild idea) to perform computations in the following order ?
* on the one side:
|0055ddff +
|..88aa.. +
|..2277..
* on the other side:
|.44ccee. +
|.1166bb. +
|...99...
|...33...
* at the end, add both parts together ?

That's a great idea. I can calculate the unaligned side (the one starting with .44ccee.) first, push it on the stack with the proper alignment, calculate the aligned side, pop the values off the stack (so they are now aligned) and then add them up. Using the stack once like this shouldn't add too many cycles (unless you had a different idea than using the stack to align the parts), but it should reduce cycles I'm using (in my current "new" routine) to add up the unaligned products.


Lionel Debroux wrote:
Two notes on the current routine:
* clr reads the destination operand before clearing it... but at only four occurrences of clr.w/.l, I'm not sure whether saving, moveq #0 and restoring an additional dn register in order to be able to use move.w/.l %dn,-(%sp) will end up being faster than clr.w/.l -(%sp)...

I'm not worried about it since I'm re-writing the whole routine anyway.
Lionel Debroux wrote:
* movem.l from registers to memory is allowed with (%an) operand, and according to the 68000 Programmers Reference manual,
Quote:
Selecting the addressing mode also selects the mode of operation of the MOVEM instruction, and only the control modes, the predecrement mode, and the postincrement mode are valid. If the effective address is specified by one of the control modes, the registers are transferred starting at the specified address, and the address is incremented by the operand length (2 or 4) following each transfer. The order of the registers is from D0 to D7, then from A0 to A7.

So I think that lea 4*4(%a0),%a0; movem.l %d0-%d3,-(%a0) could be movem.l %d0-%d3,(%a0).

Hm, so it does work. Thanks for pointing that out. I tried writing the registers to (%a0) with post-increment, but the assembler said that's an error, so I changed it to pre-decrement after adding an offset to %a0. This isn't in a performance-critical path (it's run once inside the wrapper function so I can test it from C), so it's not a big deal.


UPDATE: I got a working third version using Lionel's aligned/unaligned idea. My second version performed unaligned additions in the registers.

The third version runs at 1824 cycles (counting the bsr) and gives the correct answer, at least for my test inputs. The second version runs at 2010 cycles and does not give the correct answer. I stopped working on the second version when Lionel posted the alignment optimization idea, so I didn't stomp out the bugs.

I think 1824 cycles is pretty good. It's only 640 cycles on top of the 1184 cycles for the multiply instructions (I'm fairly certain that I can't get away with not using 16 mulu instructions, so 1184 cycles is the absolute minimum, not counting the call/return instructions). I can still fine-tune some register usage and trim off a few more cycles here and there, but I can't get it much lower without using yet another optimization trick (which would mean a fourth version of the routine).


UPDATE: I tested the third version with a few more test cases, and it passes with flying colors! If there were any bugs in it, I'm sure they would show up in all of the new tests since they touch every 16-bit word in the 64-bit operands.

Here's one of my tests:
B504F333F9DE6484 * B504F333F9DE6484 =
7FFFFFFFFFFFFFFF8171055344676410

The value 0xB504F333F9DE6484 is the square root of 2 in 1.63 fixed-point, and the product is just a sliver below 2.0 in 2.126 fixed-point (it's below 2 because the operands are rounded down). Normalizing and rounding the product to 1.63 fixed-point leaves all 64 bits set (1.FFFF...), which is 1.9999... in decimal. Rounding this to any lower precision (such as double- or single-precision floating point, or even 8-bit integer) results in exactly 2.0.


EDIT: here's a screenshot of my test program:
Good that this wild idea of mine happened to work - wild ideas don't always work Very Happy

Quote:
Using the stack once like this shouldn't add too many cycles (unless you had a different idea than using the stack to align the parts)

Not really Smile

That said... what if you used 3 or 4 address registers to store the first partial product until the final add ?
This idea may not work well, though: I can't tell in advance whether it will yield a smaller and faster routine in the end:
* at least one additional register would have to be saved and restored, assuming you're using a0-a2 (a3 ?), which only takes more time;
* the moves between dn and an (since addx works only on dn,dn or -(an),-(an)) consume some size and speed that are currently consumed as operation in memory.
Lionel Debroux wrote:
Good that this wild idea of mine happened to work - wild ideas don't always work Very Happy

Quote:
Using the stack once like this shouldn't add too many cycles (unless you had a different idea than using the stack to align the parts)

Not really Smile

That said... what if you used 3 or 4 address registers to store the first partial product until the final add ?
This idea may not work well, though: I can't tell in advance whether it will yield a smaller and faster routine in the end:
* at least one additional register would have to be saved and restored, assuming you're using a0-a2 (a3 ?), which only takes more time;
* the moves between dn and an (since addx works only on dn,dn or -(an),-(an)) consume some size and speed that are currently consumed as operation in memory.

You know, I was thinking about putting one of the intermediate values in address registers but realized that I would have to save and restore those registers on the stack anyway, in addition to the extra overhead of shifting/swapping words to align them, plus moving them back to data registers before performing addx at the end.

However, I realized more recently that fpuemu already saves all 16 registers on the stack upon entry, which leaves most of the registers available for subroutines like this 64-bit multiplication routine.

The alignment of the unaligned value (in %d1-%d3) should be as simple as this:

Code:
    sub.l %d0,%d0
    swap %d1
    swap %d2
    swap %d3
    move.w %d1,%d0
    move.w %d2,%d1
    move.w %d3,%d2
    sub.w %d3,%d3

(Not tested, and all that jazz...)

Since I would be changing only the parts that save and restore the intermediate value, I probably won't have to change anything else in the routine either. I'll get on this fourth version, eventually.... Razz


One last thing regarding the M68881 that I'm emulating: I confirmed that it actually performs calculations at a higher precision than 64 bits. The M68000 Programmer's Reference Manual says that calculations use 1 integer bit, 63 mantissa bits, 1 guard bit, 1 round bit, and 1 sticky bit, so it basically calculates with 67 bits of precision. Unfortunately, implementing this aspect would add more complexity to my emulator, and it would make calculations even slower, for little benefit. I expect that most applications will use only double (53 bit) or single (24 bit) precision anyway, which nullifies the benefits of using higher-precision intermediate values. The 64-bit values still have about 3 decimal digits more precision than the TI-AMS floating-point format. (I'm guessing that TI-AMS doesn't care that much about rounding modes either.)

Now that I think about it, I'm multiplying with 128 bits of precision, so I could implement the extra precision for multiplication, but it's not likely that I can do so for most other operations.
On an unrelated note, I just fixed my stuttering audio problem! The audio interrupt runs at 8192 times per second, so if interrupts are blocked for more than 1464 CPU cycles, the audio interrupt handler skips. Even smaller periods of blocking can be noticed as jitter too, but it's not as bad as skipping interrupts entirely.

Turns out the biggest latency offender was my showstatus() routine, which updates the status line at the bottom of the screen. This routine blocked all interrupts for a few thousand cycles (yes, it can probably be optimized a bit), and it was called at least twice every time the system transitioned from idle to busy and back to idle, which happened every time the audio interrupt noticed that the audio buffer is getting empty and wakes up the process that is filling it up. I changed showstatus() so it blocks soft interrupts instead of hardware interrupts.

I added soft interrupts to Punix somewhat recently to reduce hardware interrupt latency, especially for audio. Soft interrupts run in a non-interrupt context inside the kernel. Typically a hardware interrupt handler will do the minimum work needed inside the interrupt handler itself and then defer extra processing to run inside a soft interrupt. This way, other hardware interrupts can run while the soft interrupt is running.

Now I have smooth, (mostly) non-jittery audio. Woot! Cool
Very impressive! Are we talking legitimate audio here, or mod/chiptune-style music?
KermMartian wrote:
Very impressive! Are we talking legitimate audio here, or mod/chiptune-style music?

Chiptune-style audio would certainly be the cleanest and have the highest subjective quality, but the audio driver supports arbitrary stereo audio output at 1 bit/8192 Hz resolution. A program can output raw audio data (eg, directly from a wave file), or it can synthesize audio data (eg, from a MIDI or Chiptune file, or perhaps synthesized speech), as long as it can produce the audio data fast enough. In fact, I already have a test program in Punix which generates a few seconds of square-wave audio data for each channel (left/right/both), and then it simply calls the write() system call once for each buffer using the file descriptor from opening /dev/audio. Just think of the possibilities! From a shell you could run "cat file.au >/dev/audio" to play a file, or "cat /dev/random >/dev/audio" to play noise. Smile

Over the years I've developed simple programs to convert 16-bit audio to 1-bit with decent quality (considering 1 bit's high noise floor).

EDIT: here's a video I put together showing the audio in action:
That is very cool indeed =D. This is a pretty impressive project.
elfprince13 wrote:
That is very cool indeed =D. This is a pretty impressive project.
Definitely seconded, excellent work. And do I see some linking near the top of that screenshot?
KermMartian wrote:
elfprince13 wrote:
That is very cool indeed =D. This is a pretty impressive project.
Definitely seconded, excellent work. And do I see some linking near the top of that screenshot?

Yep, it's a linking demo/test. It receives a variable (any variable) and displays its contents to the screen. Obviously this works best with image variables. Smile

Unlike TI-AMS and TI-OS, Punix does not have a linking program built into the kernel. This demo is a regular user program that opens the /dev/link file and reads/writes using the normal read() and write() system calls. The getcalc routine is running completely in user-space (besides the read()/write() system calls, of course). Later on I plan on writing a set of linking programs that can link with other types of calculators, much like TiLP.

For transferring Punix files (programs, data files, etc), I'll probably write a simple program to encapsulate a tar file inside a .9xt (or similar) file so it can be transferred using any TI linking software. This program would run on both Punix and the computer (hopefully without any source code modifications) to wrap and unwrap the files on both sides. I don't really feel like writing Punix-only linking software for the computer when existing linking software can be used.
You should definitely write a /dev/net node that hooks into CALCnet! Although the interface would be quite kludgy, since it's not a point-to-point protocol. Your plans sound great to me thus far.
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 2 of 5
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement