EDIT: I found a bug and I managed to optimize out an average of roughly 600 cycles (I got that number from 500 simulations of randomized inputs and multiplying the eZ80 timings by 6. It might be faster.) The following code has been replaced.

I was optimising some algorithms and code in Grammer and I came across a pretty big speed optimisation for computing 16-bit GCD:

Code:

gcdHL_DE:
;gcd(HL,DE)->HL
;Output:
;   B=0
;   HL is the GCD of the inputs
;Destroys:
;   A,DE
;     DE is guaranteed 0 unless the output is 0 (which only happens if one of the inputs is 0).
;Uses the binary GCD algorithm
;65 bytes

;B is our cofactor-of-2 counter
    ld b,0

;If HL=0, return 0
    ld a,h \ or l \ ret z

;If DE=0, return 0
    ex de,hl
    ld a,h \ or l \ jr nz,gcd_test_cofactor_of_2
    ret

gcd_cofactor_2_loop:
    inc b
    srl h \ rr l
    srl d \ rr e
gcd_test_cofactor_of_2:
    inc b
    ld a,e
    or l
    rra
    jr nc,gcd_cofactor_2_loop

gcd_remove_factors_of_2_op2:
    srl h \ rr l \ jr nc,gcd_remove_factors_of_2_op2
    adc hl,hl
    jr gcd_swap_ops

gcd_swap_ops_negate:
;At this point, HL needs to be negated and swapped with DE
    xor a \ sub l \ ld l,a \ sbc a,a \ sub h \ ld h,a
gcd_swap_ops:
    ex de,hl
gcd_remove_factors_of_2_op1:
    srl h \ rr l \ jr nc,gcd_remove_factors_of_2_op1
    adc hl,hl
    sbc hl,de
    jr c,gcd_swap_ops_negate
    jp nz,gcd_remove_factors_of_2_op1

;DE is the GCD, need to shift it left B-1 times.
    ex de,hl
    dec b
    ret z
    add hl,hl \ djnz $-1
    ret

I was using a slightly more naive approach before, computing the mod of two 16-bit values over and over (the Euclidean Algorithm). This code is pretty hefty, though, at 80 bytes. You can save two bytes by turning the JP instructions into JR, but there will be a tiny speed loss of a few cycles Razz

EDIT Made it 15 bytes smaller Smile Also added comments and labels since the codeflow is a little confusing.
Since the weekend, I have been trying to really optimise my square root algorithms for application in a 24-bit and 80-bit float library. During the process, I saw some major failings in my old routines (and those were still pretty fast). The biggest problem was my huge over estimate of the bit-size of certain intermediate values. After taking that into account, I can offer a new routine that is still being rigorously optimised, but is already quite fast:

Code:

sqrtHL:
;Input: HL
;Output: A
;63 bytes
;749 t-states worst case
   ld bc,600h
   ld d,c
   ld e,c
sqrt16loop:
   add hl,hl \ rl e
   add hl,hl \ rl e
   rl c
   ld a,c
   rla
   sub e
   jr nc,$+5
   inc c
   cpl
   ld e,a
   djnz sqrt16loop

   sla c \ ld a,c \ add a,a
   rl h \ rl e
   rl h \ rl e
   jr nc,$+6
   sub e \ jp $+6
   sub e
   jr nc,$+5
   inc c
   cpl
   ld e,a
   
   ld l,c
      ld a,l
   add hl,hl \ rl e \ rl d
   add hl,hl \ rl e \ rl d
   sbc hl,de
   rla
   ret

The biggest drawback is the size, since the last two iterations of the algorithm are handled outside the loop.
However, if you really want speed, this unrolled+optimized version is 31 bytes larger and pretty darn fast at 391cc worst case:

Code:

sqrtHL:
;returns A as the sqrt, HL as the remainder, D = 0
;min: 352cc
;max: 391cc
;avg: 371.5cc
;94 bytes


  ld de,05040h  ; 10
  ld a,h        ; 4
  sub e         ; 4
  jr nc,sq7     ;\
  add a,e       ; | branch 1: 12cc
  ld d,16       ; | branch 2: 18cc
sq7:            ;/

; ----------

  cp d          ; 4
  jr c,sq6      ;\
  sub d         ; | branch 1: 12cc
  set 5,d       ; | branch 2: 19cc
sq6:            ;/

; ----------
  res 4,d       ; 8
  srl d         ; 8
  set 2,d       ; 8
  cp d          ; 4
  jr c,sq5      ;\
  sub d         ; | branch 1: 12cc
  set 3,d       ; | branch 2: 19cc
sq5:            ;/
  srl d         ; 8

; ----------

  inc a         ; 4
  sub d         ; 4
  jr nc,sq4     ;\
  dec d         ; | branch 1: 12cc
  add a,d       ; | branch 2: 19cc
  dec d         ; | <-- this resets the low bit of D, so `srl d` resets carry.
sq4:            ;/
  srl d         ; 8
  ld h,a        ; 4

; ----------

  ld a,e        ; 4
  sbc hl,de     ; 15
  jr nc,sq3     ;\
  add hl,de     ; | 12cc or 18cc
sq3:            ;/
  ccf           ; 4
  rra           ; 4
  srl d         ; 8
  rra           ; 4

; ----------

  ld e,a        ; 4
  sbc hl,de     ; 15
  jr c,sq2      ;\
  or 20h        ; | branch 1: 23cc
  db 254        ; |   <-- start of `cp *` which is 7cc to skip the next byte.
sq2:            ; | branch 2: 21cc
  add hl,de     ;/

  xor 18h       ; 7
  srl d         ; 8
  rra           ; 4

; ----------

  ld e,a        ; 4
  sbc hl,de     ; 15
  jr c,sq1      ;\
  or 8          ; | branch 1: 23cc
  db 254        ; |   <-- start of `cp *` which is 7cc to skip the next byte.
sq1:            ; | branch 2: 21cc
  add hl,de     ;/

  xor 6         ; 7
  srl d         ; 8
  rra           ; 4

; ----------

  ld e,a        ; 4
  sbc hl,de     ; 15
  jr nc,+_      ;    \
  add hl,de     ; 15  |
  srl d         ; 8   |
  rra           ; 4   | branch 1: 38cc
  ret           ; 10  | branch 2: 40cc
_:              ;     |
  inc a         ; 4   |
  srl d         ; 8   |
  rra           ; 4   |
  ret           ; 10 /


For comparison, the first is comparable to the more elegant/famous version I keep running into as I look for 'the best 16-bit square root routine', and the unrolled version is faster than any that I have run into yet.

Feel free to optimise because I am confident that there are still enough parts of code to change to improve the speed and size.
EDIT: updated the unrolled code to be over 25% faster and about 12 bytes smaller!
EDIT 9/9/19 : Optimized the unrolled version even more!
Not sure where it goes, but since the 84+C thread was filled with discussion about hardware, I didn't want to go off-topic.

So, here's a rather complete sprite drawing routine, with custom size (up to 256*256), custom coordinates, custom palette, no clipping, and that works with the regular TI-OS-set LCD windowing settings :
Code:
; BC:X coordinate
; DE:Y coordinate
; H:width
; L:height
; IX:sprite
; palette:label must refer to your custom up-to-256-bytes palette
drawSprite:
   push   de
   push   bc
   ld   a, l
   push   af

   ld   a, 20h
   out   (10h), a
   out   (10h), a
   ld   a, d
   out   (11h), a
   ld   a, e
   out   (11h), a
   ld   a, 21h
   out   (10h), a
   out   (10h), a
   ld   a, b
   out   (11h), a
   ld   a, c
   out   (11h), a
   ld   a, 22h
   out   (10h), a
   out   (10h), a
   
   ld   a, h
drawSpriteXloop:
   push   af
   push   hl
   ld   l, (ix + 0)
   inc   ix
   ld   h, 0
   add   hl, hl
   push   de
   ld   de, palette
   add   hl, de
   ld   a, (hl)
   inc   hl
   ld   h, (hl)
   ld   l, a
   pop   de
   ld a, h
   out (11h), a
   ld a, l
   out (11h), a
   pop   hl
   pop   af
   dec   a
   jr   nz, drawSpriteXloop
   
   inc   de
   pop   af
   dec   a
   jr   nz, drawSprite + 3
   
   pop   bc
   pop   de
   ret

Pretty sure it can be heavily optimized. I'm not that good at ASM.
That's pretty nice, matrefeytontias; good work. Smile If you're using Doors CSE, you can also use our highly-optimized 1-bit, 2-bit, 4-bit, 4-bit with 2x enlarging, or 8-bit routines. Runer112, who helped me frame these routines and figure out how to structure the arguments, has an eventual goal of finishing his clipping versions of these routines.
For my next trick, je vous presentez, logarithme naturel (in fixed point 8.8 format):

Code:

lognat:
;Input:  H.L needs to be on [1,2]
;Output: H.L if z flag is set, else if nz, no result
;avg speed: 677 t-states
   dec h
      dec h
   jr nz,$+8
   inc l \ dec l
   ret nz
   ld l,177
   ret
   inc h
   ret nz
   ld b,h
   ld c,l
   ld e,l
   ld d,8
   add hl,hl
   add hl,hl
   add hl,de
   ex de,hl

   add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   add hl,hl \ sbc hl,de \ adc a,a

   ld h,a \ ld l,b
   sla h \ jr c,$+3 \ ld l,c
   add hl,hl \ jr c,$+3 \ add hl,bc
   add hl,hl \ jr c,$+3 \ add hl,bc
   add hl,hl \ jr c,$+3 \ add hl,bc
   add hl,hl \ jr c,$+3 \ add hl,bc
   add hl,hl \ jr c,$+3 \ add hl,bc
   add hl,hl \ jr c,$+3 \ add hl,bc
   add hl,hl \ jr c,$+3 \ add hl,bc
   rl l
   ld a,h
   adc a,b
   ld l,a
   ld h,b
   cp a
   ret

It only works accurately on [1,2], where I designed the algorithm. It is pretty fast, returning the result in 677 t-states or less, but its range is fairly limited (to 257 values).

I have been working on another routine that actually works on (0,128), but it takes upwards of about 1500 t-states and there are some accuracy issues I want to fix before posting.
EDIT:
I have the routine with a little better speed outlook and much better accuracy outside the [1,2] range than before (worst case is 2/256).

Code:

lognat:
;Input:  H.L needs to be on (0,128.0)
;Output: H.L if c flag set
;    returns nc if input is negative (HL not modified)
;Error:
;   The error on the outputs is as follows:
;      20592 inputs are exact
;      12075 inputs are off by 1/256
;      100 inputs are off by 2/256
;   So all 32767 inputs are within 2/256, with average error being <1/683 which is smaller than 1/256.
;Size: 177 bytes
;Speed: average speed is less than 1250 t-states

   ld a,h \ or l \ jr nz,$+5
   ld h,80h \ ret
   dec h
   dec h
   jr nz,$+9
   inc l \ dec l
   jr nz,normalizeln-1
   ld l,177
   ret
   inc h
   jr nz,normalizeln
   ld b,h
   ld c,l
   ld e,l
   ld d,8
   add hl,hl
   add hl,hl
   add hl,de
   ex de,hl
   call HL_Div_DE
   ld h,a \ ld l,b
   sla h \ jr c,$+3 \ ld l,c
   add hl,hl \ jr c,$+3 \ add hl,bc
   add hl,hl \ jr c,$+3 \ add hl,bc
   add hl,hl \ jr c,$+3 \ add hl,bc
   add hl,hl \ jr c,$+3 \ add hl,bc
   add hl,hl \ jr c,$+3 \ add hl,bc
   add hl,hl \ jr c,$+3 \ add hl,bc
   add hl,hl \ jr c,$+3 \ add hl,bc
   rl l
   ld a,h
   adc a,b
   ld h,b
   ld l,a
   scf
   ret
HL_Div_DE:
   add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   add hl,hl \ sbc hl,de \ adc a,a \ ret

   inc h
normalizeln:
   xor a
   inc h \ ret m
   ld d,a \ ld e,a
   ld a,l
   jr z,toosmall
   inc e \ srl h \ rra \ jr nz,$-4
   rla \ rl h
   dec e
stepin:
   ld l,a
   push de
   call lognat
   pop de
   ;now multiply DE by 355, then divide by 2 (rounding)
   ld b,d \ ld c,e \ ld a,d
   ex de,hl
   add hl,hl
   add hl,hl   ;4
   add hl,bc   ;5
   add hl,hl   ;10
   add hl,bc   ;11
   add hl,hl   ;22
   add hl,hl
   add hl,hl
   add hl,hl
   add hl,bc
   add hl,hl
   add hl,bc
   sra h \ rr l
   adc hl,de
   scf
   ret
toosmall:
   dec d
   dec e \ add a,a \ jr nc,$-2
   inc h
   jp stepin

EDIT2:
Here are a bunch of fixed point 8.8 routines, mostly optimised for speed.

B.C*D.E→H.L

Code:

BC_Times_DE:
;  BHLA is the 32-bit result
    ld a,b
    or a
    ld hl,0
    ld b,h
;1
    add a,a
    jr nc,$+4
    ld h,d
    ld l,e
;2
    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b
;227+10b-7p
    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,b

;===
;AHL is the result of B*DE*256
    push hl
    ld h,b
    ld l,b
    ld b,a
    ld a,c
    ld c,h
;1
    add a,a
    jr nc,$+4
    ld h,d
    ld l,e
;2
    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c
;227+10b-7p
    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    add hl,hl
    rla
    jr nc,$+4
    add hl,de
    adc a,c

    pop de
;Now BDE*256+AHL
    ld c,a
    ld a,l
    ld l,h
    ld h,c
    add hl,de
    ret nc
    inc b
;BHLA is the 32-bit result
    ret


D.E/B.C→D.E

Code:

DE_Div_BC_88:
;Inputs:
;     DE,BC are 8.8 Fixed Point numbers
;Outputs:
;     DE is the 8.8 Fixed Point result (rounded to the least significant bit)
     ld a,8
     ld hl,0
Loop1:
     rl d
     adc hl,hl
     sbc hl,bc
     jr nc,$+3
    add hl,bc
     dec a
     jr nz,Loop1

     ld d,e
     ld e,a
     ld a,16
    jp $+6
DivLoop:
     add hl,bc
     dec a
     ret z

     sla e
     rl d
     adc hl,hl
    jr c,overflow
     sbc hl,bc
     jr c,DivLoop
     inc e
     jp DivLoop+1
overflow:
    or a
    sbc hl,bc
    inc e
    jp DivLoop

sqrt(H.L)→H.L

Code:

SqrtHL_prec12:
;input: HL
;Output: HL
;183 bytes
    xor a
    ld b,a

    ld e,l
    ld l,h
    ld h,a

    add hl,hl
    add hl,hl
    cp h
    jr nc,$+5
    dec h
    ld a,4

    add hl,hl
    add hl,hl
    ld c,a
    sub h
    jr nc,$+6
    cpl
    ld h,a
    inc c
    inc c

    ld a,c
    add hl,hl
    add hl,hl
    add a,a
    ld c,a
    sub h
    jr nc,$+6
    cpl
    ld h,a
    inc c
    inc c

    ld a,c
    add hl,hl
    add hl,hl
    add a,a
    ld c,a
    sub h
    jr nc,$+6
    cpl
    ld h,a
    inc c
    inc c

    ld a,c
    ld l,e

    add hl,hl
    add hl,hl
    add a,a
    ld c,a
    sub h
    jr nc,$+6
    cpl
    ld h,a
    inc c
    inc c

    ld a,c
    add hl,hl
    add hl,hl
    add a,a
    ld c,a
    sub h
    jr nc,$+6
    cpl
    ld h,a
    inc c
    inc c

    ld a,c
    add a,a \ ld c,a
    add hl,hl
    add hl,hl
    jr nc,$+6
    sub h \ jp $+6
    sub h
    jr nc,$+6
    inc c \ inc c
    cpl
    ld h,a

    ld a,l
    ld l,h
    add a,a
    ld h,a
    adc hl,hl
    adc hl,hl
    sll c \ rl b
    sbc hl,bc
    jr nc,$+3
    add hl,bc
    sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

;iteration 9
    add hl,hl \ add hl,hl
    sll c \ rl b
    sbc hl,bc
    jr nc,$+3
    add hl,bc
    sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

    add hl,hl \ add hl,hl
    sll c \ rl b
    sbc hl,bc
    jr nc,$+3
    add hl,bc
    sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

    add hl,hl \ add hl,hl
    sll c \ rl b
    sbc hl,bc
    jr nc,$+3
    add hl,bc
    sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

    add hl,hl \ add hl,hl
    sll c \ rl b
    sbc hl,bc
    jr nc,$+3
    add hl,bc
    sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a
;12th iteration completed
; output in BC
    srl b \ rr c
    ld h,b
    ld l,c
    ret

2H.L→DEH.L

Code:

pow2:
;Inputs:
;     HL is the 8.8 fixed point number 'x' for 2^x
;Outputs:
;     DEHL is the 24.8 fixed point result. If there was overflow exceeding 2^24, then this value is set to the max.
     ld a,l
     or a
     push hl     ;save H for later, H is the integer part of the power
     ld hl,1
     jr z,integerpow
     scf      ;set the carry flag so that a bit is rotated into a. This will act as our counter.
;wait until we come across the lowest bit. Also note that we
     rra
     jr nc,$-1
     ld hl,2*256
powloop:
     push af
     call FPSqrtHL    ;returns in HL
     pop af
     srl a
     jr z,integerpow
     jr nc,powloop
     add hl,hl
     jp powloop
integerpow:
     pop bc
;Now b is the integer part for 2^x that we need to multiply HL by.
     ld de,0
     ld a,b
     or a
     ret z

     add hl,hl
     rl e \ rl d \ jr c,wayoverflow
     djnz $-7
     ret
wayoverflow:
     ld hl,-1
     ld d,h
     ld e,l
     ret

log2(H.L)→H.L

Code:

Log_2_88_size:
;Inputs:
;     HL is an unsigned 8.8 fixed point number.
;Outputs:
;     HL is the signed 8.8 fixed point value of log base 2 of the input.
;Example:
;     pass HL = 3.0, returns 1.58203125 (actual is ~1.584962501...)
;averages about 39 t-states slower than original
;62 bytes
     ex de,hl
     ld hl,0
     ld a,d
     ld c,8
     or a
     jr z,DE_lessthan_1
     srl d
     jr z,logloop-1
     inc l
     rr e
     jr $-7
DE_lessthan_1:
     ld a,e
     dec hl
     or a
     ret z
     inc l
     dec l
     add a,a
     jr nc,$-2
     ld e,a

     inc d
logloop:
     add hl,hl
     push hl
     ld h,d
     ld l,e
     ld a,e
     ld b,8

     add hl,hl
     rla
     jr nc,$+5
       add hl,de
       adc a,0
     djnz $-7

     ld e,h
     ld d,a
     pop hl
     rr a         ;this is NOT supposed to be rra, we need the z flag affected
     jr z,$+7
       srl d
       rr e
       inc l
     dec c
     jr nz,logloop
     ret

sin(D.E)→H.L (uses the BC_Times_DE routine above)

Code:

ine_88:
;Inputs: de
    push de
    sra d \ rr e
    ld b,d \ ld c,e
    call BC_Times_DE
    push hl     ;x^2/4
    sra h \ rr l
    ex de,hl
    ld b,d \ ld c,e
    call BC_Times_DE
    sra h \ rr l
    inc h
    ex (sp),hl    ;x^4/128+1 is on stack, HL=x^2/4
    xor a
    ld d,a
    ld b,h
    ld c,l
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,bc \ adc a,d
    ld b,h
    ld c,l
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,hl \ rla
    add hl,bc \ adc a,d
    ld e,l
    ld l,h
    ld h,a
    rl e
    adc hl,hl
    rl e
    jr nc,$+3
    inc hl

    pop de
    ex hl,de
    or a
    sbc hl,de
    ex de,hl
    pop bc
    jp BC_Times_DE

EDIT3: If you are willing to use a 256 byte LUT, this is a much, much faster fixed point lg() routine that I thought of on my walk home:

Code:

lg_88:
;Input: HL is a fixed point number
;Output: lg(H.L)->H.L
;Speed: Avg: 340
   ld de,lgLUT
   ld b,0
   ld a,h
   or a
   ret m
   ld a,l
   jr z,$+8
   inc b \ srl h \ rra \ jr nz,$-4
   or a \ jr nz,$+6
   ld hl,8000h \ ret
   rra \ inc b \ jr nc,$-2
;A is the element to look up in the LUT
   ld l,a
    ld c,h
    dec b
   add hl,hl
   add hl,de
   ld e,(hl)
   inc hl
   ld d,(hl)
    ex de,hl
   add hl,bc
   ret
lglut:
.dw $F800
.dw $F996
.dw $FA52
.dw $FACF
.dw $FB2C
.dw $FB76
.dw $FBB3
.dw $FBE8
.dw $FC16
.dw $FC3F
.dw $FC64
.dw $FC86
.dw $FCA5
.dw $FCC1
.dw $FCDC
.dw $FCF4
.dw $FD0B
.dw $FD21
.dw $FD36
.dw $FD49
.dw $FD5C
.dw $FD6D
.dw $FD7E
.dw $FD8E
.dw $FD9D
.dw $FDAC
.dw $FDBA
.dw $FDC8
.dw $FDD5
.dw $FDE2
.dw $FDEE
.dw $FDFA
.dw $FE06
.dw $FE11
.dw $FE1C
.dw $FE26
.dw $FE31
.dw $FE3B
.dw $FE44
.dw $FE4E
.dw $FE57
.dw $FE60
.dw $FE69
.dw $FE71
.dw $FE7A
.dw $FE82
.dw $FE8A
.dw $FE92
.dw $FE9A
.dw $FEA1
.dw $FEA9
.dw $FEB0
.dw $FEB7
.dw $FEBE
.dw $FEC5
.dw $FECB
.dw $FED2
.dw $FED8
.dw $FEDF
.dw $FEE5
.dw $FEEB
.dw $FEF1
.dw $FEF7
.dw $FEFD
.dw $FF03
.dw $FF09
.dw $FF0E
.dw $FF14
.dw $FF19
.dw $FF1E
.dw $FF24
.dw $FF29
.dw $FF2E
.dw $FF33
.dw $FF38
.dw $FF3D
.dw $FF42
.dw $FF47
.dw $FF4B
.dw $FF50
.dw $FF55
.dw $FF59
.dw $FF5E
.dw $FF62
.dw $FF67
.dw $FF6B
.dw $FF6F
.dw $FF74
.dw $FF78
.dw $FF7C
.dw $FF80
.dw $FF84
.dw $FF88
.dw $FF8C
.dw $FF90
.dw $FF94
.dw $FF98
.dw $FF9B
.dw $FF9F
.dw $FFA3
.dw $FFA7
.dw $FFAA
.dw $FFAE
.dw $FFB2
.dw $FFB5
.dw $FFB9
.dw $FFBC
.dw $FFC0
.dw $FFC3
.dw $FFC6
.dw $FFCA
.dw $FFCD
.dw $FFD0
.dw $FFD4
.dw $FFD7
.dw $FFDA
.dw $FFDD
.dw $FFE0
.dw $FFE4
.dw $FFE7
.dw $FFEA
.dw $FFED
.dw $FFF0
.dw $FFF3
.dw $FFF6
.dw $FFF9
.dw $FFFC
.dw $FFFF

It averages about 340 t-states and should be accurate to every bit. Being this fast, it might be better to combine this with the very fast multiplication routine above to get log base x Wink Or for example, you can incorporate a natural log routine by simply multiplying lg(x) by 1/lg(e) (which is approximately 355/512):

Code:

ln_88:
;Input: HL is a fixed point number
;Output: ln(H.L)->H.L
;Speed: Avg: 340+(325 worst case)
   call lg_88
   ;now signed multiply HL by 355, then divide by 2 (rounding)
    ld de,0
    bit 7,h
    jr z,$+9
    dec e \ xor a \ sub l \ ld l,a
    sbc a,a \ sub h \ ld h,a
    ld b,h
    ld c,l
    xor a
   add hl,hl
      add hl,hl \ rla
   add hl,bc \ adc a,d
   add hl,hl \ rla
   add hl,bc \ adc a,d
   add hl,hl \ rla
   add hl,hl \ rla
   add hl,hl \ rla
   add hl,hl \ rla
   add hl,bc \ adc a,d
   add hl,hl \ rla
   add hl,bc \ adc a,d
    sra a \ rr h
    ld l,h
    ld h,a
    inc e
    ret nz
    xor a \ sub l \ ld l,a
    sbc a,a \ sub h \ ld h,a
    ret

The overhead being at most 325 t-states, bringing it to an average of 665 t-states for ln() on (0,128)
To add to the table-driven routines, this is a much faster arctan routine, using a 158-byte table (it is <800 t-states on values >1, and about 350 for values on [0,1]) It is also accurate in every bit for most values, but I imagine there are a few that are off in the last bit:

Code:

arctan_88:
;Input:
;   D.E
;Output: atan(D.E)->D.E
   push de
   ld a,d
   or a
   jp p,$+5
   neg
   ld d,a
   dec a
   jr nz,checkneedinv
   inc e \ dec e \ jr nz,checkneedinv
   pop af \ rla \ ld de,201 \ ret nc \ ld de,-201 \ ret
checkneedinv:
   inc a
   call nz,DEgt1_Inv
;0.E is the value to atan
   ld hl,adjustatan
   push hl
   ld a,e
   cp 46 \ ret c
   dec a \ cp 42h \ ret c
   dec a \ cp 4Eh \ ret c
   dec a \ cp 57h \ ret c
   dec a \ cp 5Eh \ ret c
   dec a \ cp 64h \ ret c
   dec a \ cp 6Ah \ ret c
   dec a \ cp 6Fh \ ret c
   sub 6Fh \ ld e,a
   ld hl,atanlut
   add hl,de
   ld a,(hl)
   ret
adjustatan:
   ld e,a
   pop bc
   ld a,b
   or a
   jp p,$+5
   neg
   jr z,$+9
   ld hl,402
   or a
   sbc hl,de
   ex de,hl
   rl b
   ret nc
   xor a
   sub e
   ld e,a
   sbc a,a
   sub d
   ld d,a
   ret


DEgt1_Inv:
;Works if DE>1
   ld hl,256
   ld b,8
InvLoop:
   add hl,hl
   sbc hl,de
   jr nc,$+3
   add hl,de
   adc a,a
   djnz InvLoop
    cpl
   ld e,a
    ld d,b
    ret
atanlut:
.db $6F
.db $6F
.db $70
.db $71
.db $72
.db $73
.db $73
.db $74
.db $75
.db $76
.db $77
.db $77
.db $78
.db $79
.db $7A
.db $7B
.db $7B
.db $7C
.db $7D
.db $7E
.db $7F
.db $7F
.db $80
.db $81
.db $82
.db $82
.db $83
.db $84
.db $85
.db $85
.db $86
.db $87
.db $88
.db $88
.db $89
.db $8A
.db $8B
.db $8B
.db $8C
.db $8D
.db $8E
.db $8E
.db $8F
.db $90
.db $90
.db $91
.db $92
.db $93
.db $93
.db $94
.db $95
.db $95
.db $96
.db $97
.db $97
.db $98
.db $99
.db $9A
.db $9A
.db $9B
.db $9C
.db $9C
.db $9D
.db $9E
.db $9E
.db $9F
.db $A0
.db $A0
.db $A1
.db $A2
.db $A2
.db $A3
.db $A3
.db $A4
.db $A5
.db $A5
.db $A6
.db $A7
.db $A7
.db $A8
.db $A9
.db $A9
.db $AA
.db $AA
.db $AB
.db $AC
.db $AC
.db $AD
.db $AD
.db $AE
.db $AF
.db $AF
.db $B0
.db $B0
.db $B1
.db $B2
.db $B2
.db $B3
.db $B3
.db $B4
.db $B5
.db $B5
.db $B6
.db $B6
.db $B7
.db $B7
.db $B8
.db $B9
.db $B9
.db $BA
.db $BA
.db $BB
.db $BB
.db $BC
.db $BC
.db $BD
.db $BE
.db $BE
.db $BF
.db $BF
.db $C0
.db $C0
.db $C1
.db $C1
.db $C2
.db $C2
.db $C3
.db $C3
.db $C4
.db $C4
.db $C5
.db $C6
.db $C6
.db $C7
.db $C7
.db $C8
.db $C8
.db $C9

It is looking like table methods are the best option for 8.8 fixed point, especially if you can reduce the range of needed values to be on [0,1] Razz (see the above post, last edit for very fast ln() and lg() using a table)

Similarly, sine and cosine can be generated from a single 201 byte array.
I wrote a pseudo-random number generator with period 4292870399 today that I was hoping would be comparable to the version used by the OS (and it appears to be so). It takes approximately 1592 t-states making it much faster than using the OS bcalls and generates a 3-byte random integer:

Code:

Rand24:
;Inputs: seed1,seed2
;Outputs:
;   AHL is the pseudo-random number
;   seed1,seed2 incremented accordingly
;Destroys: BC,DE
;Notes:
; seed1*243+83 mod 65519 -> seed1
; seed2*251+43 mod 65521 -> seed2
; Output seed1*seed2
   ld hl,(seed1)
   xor a
   ld b,h \ ld c,l
   ld de,83
   add hl,hl \ rla      ;2
   add hl,bc \ adc a,d   ;3
   add hl,hl \ rla      ;6
   add hl,bc \ adc a,d   ;7
   add hl,hl \ rla      ;14
   add hl,bc \ adc a,d   ;15
   add hl,hl \ rla      ;30
   add hl,hl \ rla      ;60
   add hl,hl \ rla      ;120
   add hl,bc \ adc a,d   ;121
   add hl,hl \ rla      ;242
   add hl,bc \ adc a,d   ;243
   add hl,de \ adc a,d   ;243x+83
;now AHL mod 65519
; Essentially, we can at least subtract A*65519=A(65536-17), so add A*17 to HL
   ex de,hl
   ld l,a
   ld b,h
   ld c,l
   add hl,hl
   add hl,hl
   add hl,hl
   add hl,hl
   add hl,bc
   add hl,de
   ld de,65519
   jr nc,$+5
   or a \ sbc hl,de
   or a \ sbc hl,de
   jr nc,$+3
   add hl,de
   ld (seed1),hl
;seed1 done, now we need to do seed2
   ld hl,(seed2)
; seed1*243+83 mod 65519 -> seed1
; seed2*251+43 mod 65521 -> seed2
; Output seed1*seed2
   xor a
   ld b,h \ ld c,l
   ld de,43
   add hl,hl \ rla      ;2
   add hl,bc \ adc a,d   ;3
   add hl,hl \ rla      ;6
   add hl,bc \ adc a,d   ;7
   add hl,hl \ rla      ;14
   add hl,bc \ adc a,d   ;15
   add hl,hl \ rla      ;30
   add hl,bc \ adc a,d   ;31
   add hl,hl \ rla      ;62
   add hl,hl \ rla      ;124
   add hl,bc \ adc a,d   ;125
   add hl,hl \ rla      ;250
   add hl,bc \ adc a,d   ;251
   add hl,de \ adc a,d   ;251x+83
;now AHL mod 65521
; Essentially, we can at least subtract A*65521=A(65536-15), so add A*15 to HL
   ex de,hl
   ld l,a
   ld b,h
   ld c,l
   add hl,hl
   add hl,hl
   add hl,hl
   add hl,hl
   sbc hl,bc
   add hl,de
   ld de,65521
   jr nc,$+5
   or a \ sbc hl,de
   or a \ sbc hl,de
   jr nc,$+3
   add hl,de
   ld (seed2),hl
;now seed1 and seed 2 are computed
   ld bc,(seed1)
   ex de,hl
;now do BC_times_DE
   call BC_Times_DE
      ex de,hl
   ld l,b
   ld h,0
   ld b,h
   ld c,l
   add hl,hl
   add hl,hl
   add hl,bc
   add hl,hl
   add hl,hl
   add hl,bc
   add hl,hl
   add hl,bc
   ld c,d
   ld d,e
   ld e,a
   ld a,c
   sbc hl,bc
   sbc a,b
   ret nc
   ld c,43
   add hl,bc
   ret
BC_Times_DE:
;BHLA is the result
   ld a,b
   or a
   ld hl,0
   ld b,h
;1
   add a,a
   jr nc,$+4
   ld h,d
   ld l,e
;2
   add hl,hl
   rla
   jr nc,$+4
   add hl,de
   adc a,b
;227+10b-7p
   add hl,hl
   rla
   jr nc,$+4
   add hl,de
   adc a,b

   add hl,hl
   rla
   jr nc,$+4
   add hl,de
   adc a,b

   add hl,hl
   rla
   jr nc,$+4
   add hl,de
   adc a,b

   add hl,hl
   rla
   jr nc,$+4
   add hl,de
   adc a,b

   add hl,hl
   rla
   jr nc,$+4
   add hl,de
   adc a,b

   add hl,hl
   rla
   jr nc,$+4
   add hl,de
   adc a,b

;===
;AHL is the result of B*DE*256
   push hl
   ld h,b
   ld l,b
   ld b,a
   ld a,c
   ld c,h
;1
   add a,a
   jr nc,$+4
   ld h,d
   ld l,e
;2
   add hl,hl
   rla
   jr nc,$+4
   add hl,de
   adc a,c
;227+10b-7p
   add hl,hl
   rla
   jr nc,$+4
   add hl,de
   adc a,c

   add hl,hl
   rla
   jr nc,$+4
   add hl,de
   adc a,c

   add hl,hl
   rla
   jr nc,$+4
   add hl,de
   adc a,c

   add hl,hl
   rla
   jr nc,$+4
   add hl,de
   adc a,c

   add hl,hl
   rla
   jr nc,$+4
   add hl,de
   adc a,c

   add hl,hl
   rla
   jr nc,$+4
   add hl,de
   adc a,c

   pop de
;Now BDE*256+AHL
   ld c,a
   ld a,l
   ld l,h
   ld h,c
   add hl,de
   ret nc
   inc b
;BHLA is the 32-bit result
   ret
seed1:
   .dw 0
seed2:
   .dw 0

Naturally, I plan to use this in my math libraries. To quickly describe the algorithm, I use 2 LCGs with relatively prime period lengths, generate the next terms for each of them, and multiply the results for the output (so if you really wanted to, you could get 31 bits out of BHLA, but the upper bit shouldn't be very reliable).

EDIT: Modified to perform a final mod 16777259 to provide better results.
What does your distribution look like? Seems as though you'll only be generating random composites.
Yes, that is what would be happening if you used the full 32-bit output of the multiplication in BHLA. That is one reason for why I suggested using bits from the lower 24 bits. Using bits 8 to 23 (in HL), I have generated some primes, such as 9767, just now.
Can you run it a bunch of times and show us the results? Maybe a nice pretty graph!
Yeah, a histogram would be great Smile
I made a quick graph:

I know nothing is labeled. I made it so the range was 0-499. X axis is the value returned by the random numbers, Y axis is the number of that number returned. 50000 numbers were generated.

As a comparison, here's the same with .net's random:
Yeah, I noticed some problems, too. Instead of just taking the lower 24 bits, it seems that it is better to mod the final result by an odd number (I tried the prime 16777259 and it yields better results). Currently, it is about 3 times more likely to return an even number compared to an odd number, but moding the final result by an odd number balances the output. It helps that large mods are relatively easy to perform.
Here it is with that:
Much better! Although since division is so expensive, what if you just drop the lowest 3 bits of that original output? What happens then?
Wouldn't that make it so the number 1 is never returned?
Actually, there would be no division in the sense you may be thinking about. The routine that I posted does two non-trivial modulos on 24-bit numbers and a 16x16->32 multiplication all in under 1500 t-states. Since it is constants we are using, there are some tricks that can be employed. For example, to do AHL mod 65529, I noted that 65529=65536-17, so if I subtracted A*65536, I would need to add A*17 which is much easier to do and I would only need to add that to the HL register. What is more, if the addition sets the carry flag, I can just subtract 65529 or add 17 to HL (which reminds me that I posted that unoptimised version that subtracts). Then you just add 17 again, if that sets the carry flag, it means HL was >=65529, so the addition was good, otherwise, subtract the 17.

This costs about 162 more cycles to do modulo 16777259 on BHLA and I will soon edit the post to reflect the new change.
Xeda112358 wrote:
Yeah, I noticed some problems, too. Instead of just taking the lower 24 bits, it seems that it is better to mod the final result by an odd number (I tried the prime 16777259 and it yields better results).


It's probably important that it's prime, not just odd, since otherwise you're not working on a field.
If that's of any use to anyone, here is a signed AHL = AHL / DE routine - I think it's pretty optimized (70 bytes) :

Code:
sDiv2416:
 xor d
 push af
 xor d
 ld b, a
 jp p, divdPos
 xor a
 sub l
 ld l, a
 ld a, 0
 sbc a, h
 ld h, a
 ld a, 0
 sbc a, b
 ld b, a
divdPos:
 bit 7, d
 jr z, divrPos
 xor a
 sub e
 ld e, a
 ld a, 0
 sbc a, d
 ld d, a
 ld a, b
divrPos:

 push hl
 pop ix
 ld hl, 0
 ld b, 24
divLoop:
 add ix, ix
 rla
 adc hl, hl
 sbc hl, de
 jr nc, $ + 4
 add hl, de
 ; jp nc, xxxx
 ; it works because the carry can not be set and "inc ix" is 2 bytes
 .db $D2
 inc ix
 djnz divLoop

 push ix
 pop hl
 ld b, a
 pop af
 ld a, b
 ret p
 xor a
 sub l
 ld l, a
 ld a, 0
 sbc a, h
 ld h, a
 ld a, 0
 sbc a, b
 ret
Been writing code recently
https://www.cemetech.net/forum/viewtopic.php?t=10848 - Binary to BCD conversion/display for large values
https://www.cemetech.net/forum/viewtopic.php?t=10678 - Enabling and using half-res mode on the TI-84+ C Silver Edition
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
» Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8  Next
» View previous topic :: View next topic  
Page 6 of 8
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement