» Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8  Next
EDIT: I found a bug and I managed to optimize out an average of roughly 600 cycles (I got that number from 500 simulations of randomized inputs and multiplying the eZ80 timings by 6. It might be faster.) The following code has been replaced.

I was optimising some algorithms and code in Grammer and I came across a pretty big speed optimisation for computing 16-bit GCD:

Code:
``` gcdHL_DE: ;gcd(HL,DE)->HL ;Output: ;   B=0 ;   HL is the GCD of the inputs ;Destroys: ;   A,DE ;     DE is guaranteed 0 unless the output is 0 (which only happens if one of the inputs is 0). ;Uses the binary GCD algorithm ;65 bytes ;B is our cofactor-of-2 counter     ld b,0 ;If HL=0, return 0     ld a,h \ or l \ ret z ;If DE=0, return 0     ex de,hl     ld a,h \ or l \ jr nz,gcd_test_cofactor_of_2     ret gcd_cofactor_2_loop:     inc b     srl h \ rr l     srl d \ rr e gcd_test_cofactor_of_2:     inc b     ld a,e     or l     rra     jr nc,gcd_cofactor_2_loop gcd_remove_factors_of_2_op2:     srl h \ rr l \ jr nc,gcd_remove_factors_of_2_op2     adc hl,hl     jr gcd_swap_ops gcd_swap_ops_negate: ;At this point, HL needs to be negated and swapped with DE     xor a \ sub l \ ld l,a \ sbc a,a \ sub h \ ld h,a gcd_swap_ops:     ex de,hl gcd_remove_factors_of_2_op1:     srl h \ rr l \ jr nc,gcd_remove_factors_of_2_op1     adc hl,hl     sbc hl,de     jr c,gcd_swap_ops_negate     jp nz,gcd_remove_factors_of_2_op1 ;DE is the GCD, need to shift it left B-1 times.     ex de,hl     dec b     ret z     add hl,hl \ djnz \$-1     ret ```

I was using a slightly more naive approach before, computing the mod of two 16-bit values over and over (the Euclidean Algorithm). This code is pretty hefty, though, at 80 bytes. You can save two bytes by turning the JP instructions into JR, but there will be a tiny speed loss of a few cycles

EDIT Made it 15 bytes smaller Also added comments and labels since the codeflow is a little confusing.
Since the weekend, I have been trying to really optimise my square root algorithms for application in a 24-bit and 80-bit float library. During the process, I saw some major failings in my old routines (and those were still pretty fast). The biggest problem was my huge over estimate of the bit-size of certain intermediate values. After taking that into account, I can offer a new routine that is still being rigorously optimised, but is already quite fast:

Code:
``` sqrtHL: ;Input: HL ;Output: A ;63 bytes ;749 t-states worst case    ld bc,600h    ld d,c    ld e,c sqrt16loop:    add hl,hl \ rl e    add hl,hl \ rl e    rl c    ld a,c    rla    sub e    jr nc,\$+5    inc c    cpl    ld e,a    djnz sqrt16loop    sla c \ ld a,c \ add a,a    rl h \ rl e    rl h \ rl e    jr nc,\$+6    sub e \ jp \$+6    sub e    jr nc,\$+5    inc c    cpl    ld e,a        ld l,c       ld a,l    add hl,hl \ rl e \ rl d    add hl,hl \ rl e \ rl d    sbc hl,de    rla    ret ```

The biggest drawback is the size, since the last two iterations of the algorithm are handled outside the loop.
However, if you really want speed, this unrolled+optimized version is 31 bytes larger and pretty darn fast at 391cc worst case:

Code:
``` sqrtHL: ;returns A as the sqrt, HL as the remainder, D = 0 ;min: 352cc ;max: 391cc ;avg: 371.5cc ;94 bytes   ld de,05040h  ; 10   ld a,h        ; 4   sub e         ; 4   jr nc,sq7     ;\   add a,e       ; | branch 1: 12cc   ld d,16       ; | branch 2: 18cc sq7:            ;/ ; ----------   cp d          ; 4   jr c,sq6      ;\   sub d         ; | branch 1: 12cc   set 5,d       ; | branch 2: 19cc sq6:            ;/ ; ----------   res 4,d       ; 8   srl d         ; 8   set 2,d       ; 8   cp d          ; 4   jr c,sq5      ;\   sub d         ; | branch 1: 12cc   set 3,d       ; | branch 2: 19cc sq5:            ;/   srl d         ; 8 ; ----------   inc a         ; 4   sub d         ; 4   jr nc,sq4     ;\   dec d         ; | branch 1: 12cc   add a,d       ; | branch 2: 19cc   dec d         ; | <-- this resets the low bit of D, so `srl d` resets carry. sq4:            ;/   srl d         ; 8   ld h,a        ; 4 ; ----------   ld a,e        ; 4   sbc hl,de     ; 15   jr nc,sq3     ;\   add hl,de     ; | 12cc or 18cc sq3:            ;/   ccf           ; 4   rra           ; 4   srl d         ; 8   rra           ; 4 ; ----------   ld e,a        ; 4   sbc hl,de     ; 15   jr c,sq2      ;\   or 20h        ; | branch 1: 23cc   db 254        ; |   <-- start of `cp *` which is 7cc to skip the next byte. sq2:            ; | branch 2: 21cc   add hl,de     ;/   xor 18h       ; 7   srl d         ; 8   rra           ; 4 ; ----------   ld e,a        ; 4   sbc hl,de     ; 15   jr c,sq1      ;\   or 8          ; | branch 1: 23cc   db 254        ; |   <-- start of `cp *` which is 7cc to skip the next byte. sq1:            ; | branch 2: 21cc   add hl,de     ;/   xor 6         ; 7   srl d         ; 8   rra           ; 4 ; ----------   ld e,a        ; 4   sbc hl,de     ; 15   jr nc,+_      ;    \   add hl,de     ; 15  |   srl d         ; 8   |   rra           ; 4   | branch 1: 38cc   ret           ; 10  | branch 2: 40cc _:              ;     |   inc a         ; 4   |   srl d         ; 8   |   rra           ; 4   |   ret           ; 10 / ```

For comparison, the first is comparable to the more elegant/famous version I keep running into as I look for 'the best 16-bit square root routine', and the unrolled version is faster than any that I have run into yet.

Feel free to optimise because I am confident that there are still enough parts of code to change to improve the speed and size.
EDIT: updated the unrolled code to be over 25% faster and about 12 bytes smaller!
EDIT 9/9/19 : Optimized the unrolled version even more!
Not sure where it goes, but since the 84+C thread was filled with discussion about hardware, I didn't want to go off-topic.

So, here's a rather complete sprite drawing routine, with custom size (up to 256*256), custom coordinates, custom palette, no clipping, and that works with the regular TI-OS-set LCD windowing settings :
Code:
```; BC:X coordinate ; DE:Y coordinate ; H:width ; L:height ; IX:sprite ; palette:label must refer to your custom up-to-256-bytes palette drawSprite:    push   de    push   bc    ld   a, l    push   af    ld   a, 20h    out   (10h), a    out   (10h), a    ld   a, d    out   (11h), a    ld   a, e    out   (11h), a    ld   a, 21h    out   (10h), a    out   (10h), a    ld   a, b    out   (11h), a    ld   a, c    out   (11h), a    ld   a, 22h    out   (10h), a    out   (10h), a        ld   a, h drawSpriteXloop:    push   af    push   hl    ld   l, (ix + 0)    inc   ix    ld   h, 0    add   hl, hl    push   de    ld   de, palette    add   hl, de    ld   a, (hl)    inc   hl    ld   h, (hl)    ld   l, a    pop   de    ld a, h    out (11h), a    ld a, l    out (11h), a    pop   hl    pop   af    dec   a    jr   nz, drawSpriteXloop        inc   de    pop   af    dec   a    jr   nz, drawSprite + 3        pop   bc    pop   de    ret```

Pretty sure it can be heavily optimized. I'm not that good at ASM.
That's pretty nice, matrefeytontias; good work. If you're using Doors CSE, you can also use our highly-optimized 1-bit, 2-bit, 4-bit, 4-bit with 2x enlarging, or 8-bit routines. Runer112, who helped me frame these routines and figure out how to structure the arguments, has an eventual goal of finishing his clipping versions of these routines.
For my next trick, je vous presentez, logarithme naturel (in fixed point 8.8 format):

Code:
``` lognat: ;Input:  H.L needs to be on [1,2] ;Output: H.L if z flag is set, else if nz, no result ;avg speed: 677 t-states    dec h       dec h    jr nz,\$+8    inc l \ dec l    ret nz    ld l,177    ret    inc h    ret nz    ld b,h    ld c,l    ld e,l    ld d,8    add hl,hl    add hl,hl    add hl,de    ex de,hl    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ adc a,a    ld h,a \ ld l,b    sla h \ jr c,\$+3 \ ld l,c    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    rl l    ld a,h    adc a,b    ld l,a    ld h,b    cp a    ret ```

It only works accurately on [1,2], where I designed the algorithm. It is pretty fast, returning the result in 677 t-states or less, but its range is fairly limited (to 257 values).

I have been working on another routine that actually works on (0,128), but it takes upwards of about 1500 t-states and there are some accuracy issues I want to fix before posting.
EDIT:
I have the routine with a little better speed outlook and much better accuracy outside the [1,2] range than before (worst case is 2/256).

Code:
``` lognat: ;Input:  H.L needs to be on (0,128.0) ;Output: H.L if c flag set ;    returns nc if input is negative (HL not modified) ;Error: ;   The error on the outputs is as follows: ;      20592 inputs are exact ;      12075 inputs are off by 1/256 ;      100 inputs are off by 2/256 ;   So all 32767 inputs are within 2/256, with average error being <1/683 which is smaller than 1/256. ;Size: 177 bytes ;Speed: average speed is less than 1250 t-states    ld a,h \ or l \ jr nz,\$+5    ld h,80h \ ret    dec h    dec h    jr nz,\$+9    inc l \ dec l    jr nz,normalizeln-1    ld l,177    ret    inc h    jr nz,normalizeln    ld b,h    ld c,l    ld e,l    ld d,8    add hl,hl    add hl,hl    add hl,de    ex de,hl    call HL_Div_DE    ld h,a \ ld l,b    sla h \ jr c,\$+3 \ ld l,c    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    rl l    ld a,h    adc a,b    ld h,b    ld l,a    scf    ret HL_Div_DE:    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ adc a,a \ ret    inc h normalizeln:    xor a    inc h \ ret m    ld d,a \ ld e,a    ld a,l    jr z,toosmall    inc e \ srl h \ rra \ jr nz,\$-4    rla \ rl h    dec e stepin:    ld l,a    push de    call lognat    pop de    ;now multiply DE by 355, then divide by 2 (rounding)    ld b,d \ ld c,e \ ld a,d    ex de,hl    add hl,hl    add hl,hl   ;4    add hl,bc   ;5    add hl,hl   ;10    add hl,bc   ;11    add hl,hl   ;22    add hl,hl    add hl,hl    add hl,hl    add hl,bc    add hl,hl    add hl,bc    sra h \ rr l    adc hl,de    scf    ret toosmall:    dec d    dec e \ add a,a \ jr nc,\$-2    inc h    jp stepin ```

EDIT2:
Here are a bunch of fixed point 8.8 routines, mostly optimised for speed.

B.C*D.E→H.L

Code:
``` BC_Times_DE: ;  BHLA is the 32-bit result     ld a,b     or a     ld hl,0     ld b,h ;1     add a,a     jr nc,\$+4     ld h,d     ld l,e ;2     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b ;227+10b-7p     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b ;=== ;AHL is the result of B*DE*256     push hl     ld h,b     ld l,b     ld b,a     ld a,c     ld c,h ;1     add a,a     jr nc,\$+4     ld h,d     ld l,e ;2     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c ;227+10b-7p     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     pop de ;Now BDE*256+AHL     ld c,a     ld a,l     ld l,h     ld h,c     add hl,de     ret nc     inc b ;BHLA is the 32-bit result     ret ```

D.E/B.C→D.E

Code:
``` DE_Div_BC_88: ;Inputs: ;     DE,BC are 8.8 Fixed Point numbers ;Outputs: ;     DE is the 8.8 Fixed Point result (rounded to the least significant bit)      ld a,8      ld hl,0 Loop1:      rl d      adc hl,hl      sbc hl,bc      jr nc,\$+3     add hl,bc      dec a      jr nz,Loop1      ld d,e      ld e,a      ld a,16     jp \$+6 DivLoop:      add hl,bc      dec a      ret z      sla e      rl d      adc hl,hl     jr c,overflow      sbc hl,bc      jr c,DivLoop      inc e      jp DivLoop+1 overflow:     or a     sbc hl,bc     inc e     jp DivLoop ```

sqrt(H.L)→H.L

Code:
``` SqrtHL_prec12: ;input: HL ;Output: HL ;183 bytes     xor a     ld b,a     ld e,l     ld l,h     ld h,a     add hl,hl     add hl,hl     cp h     jr nc,\$+5     dec h     ld a,4     add hl,hl     add hl,hl     ld c,a     sub h     jr nc,\$+6     cpl     ld h,a     inc c     inc c     ld a,c     add hl,hl     add hl,hl     add a,a     ld c,a     sub h     jr nc,\$+6     cpl     ld h,a     inc c     inc c     ld a,c     add hl,hl     add hl,hl     add a,a     ld c,a     sub h     jr nc,\$+6     cpl     ld h,a     inc c     inc c     ld a,c     ld l,e     add hl,hl     add hl,hl     add a,a     ld c,a     sub h     jr nc,\$+6     cpl     ld h,a     inc c     inc c     ld a,c     add hl,hl     add hl,hl     add a,a     ld c,a     sub h     jr nc,\$+6     cpl     ld h,a     inc c     inc c     ld a,c     add a,a \ ld c,a     add hl,hl     add hl,hl     jr nc,\$+6     sub h \ jp \$+6     sub h     jr nc,\$+6     inc c \ inc c     cpl     ld h,a     ld a,l     ld l,h     add a,a     ld h,a     adc hl,hl     adc hl,hl     sll c \ rl b     sbc hl,bc     jr nc,\$+3     add hl,bc     sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a ;iteration 9     add hl,hl \ add hl,hl     sll c \ rl b     sbc hl,bc     jr nc,\$+3     add hl,bc     sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a     add hl,hl \ add hl,hl     sll c \ rl b     sbc hl,bc     jr nc,\$+3     add hl,bc     sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a     add hl,hl \ add hl,hl     sll c \ rl b     sbc hl,bc     jr nc,\$+3     add hl,bc     sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a     add hl,hl \ add hl,hl     sll c \ rl b     sbc hl,bc     jr nc,\$+3     add hl,bc     sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a ;12th iteration completed ; output in BC     srl b \ rr c     ld h,b     ld l,c     ret ```

2H.L→DEH.L

Code:
``` pow2: ;Inputs: ;     HL is the 8.8 fixed point number 'x' for 2^x ;Outputs: ;     DEHL is the 24.8 fixed point result. If there was overflow exceeding 2^24, then this value is set to the max.      ld a,l      or a      push hl     ;save H for later, H is the integer part of the power      ld hl,1      jr z,integerpow      scf      ;set the carry flag so that a bit is rotated into a. This will act as our counter. ;wait until we come across the lowest bit. Also note that we      rra      jr nc,\$-1      ld hl,2*256 powloop:      push af      call FPSqrtHL    ;returns in HL      pop af      srl a      jr z,integerpow      jr nc,powloop      add hl,hl      jp powloop integerpow:      pop bc ;Now b is the integer part for 2^x that we need to multiply HL by.      ld de,0      ld a,b      or a      ret z      add hl,hl      rl e \ rl d \ jr c,wayoverflow      djnz \$-7      ret wayoverflow:      ld hl,-1      ld d,h      ld e,l      ret ```

log2(H.L)→H.L

Code:
``` Log_2_88_size: ;Inputs: ;     HL is an unsigned 8.8 fixed point number. ;Outputs: ;     HL is the signed 8.8 fixed point value of log base 2 of the input. ;Example: ;     pass HL = 3.0, returns 1.58203125 (actual is ~1.584962501...) ;averages about 39 t-states slower than original ;62 bytes      ex de,hl      ld hl,0      ld a,d      ld c,8      or a      jr z,DE_lessthan_1      srl d      jr z,logloop-1      inc l      rr e      jr \$-7 DE_lessthan_1:      ld a,e      dec hl      or a      ret z      inc l      dec l      add a,a      jr nc,\$-2      ld e,a      inc d logloop:      add hl,hl      push hl      ld h,d      ld l,e      ld a,e      ld b,8      add hl,hl      rla      jr nc,\$+5        add hl,de        adc a,0      djnz \$-7      ld e,h      ld d,a      pop hl      rr a         ;this is NOT supposed to be rra, we need the z flag affected      jr z,\$+7        srl d        rr e        inc l      dec c      jr nz,logloop      ret ```

sin(D.E)→H.L (uses the BC_Times_DE routine above)

Code:
``` ine_88: ;Inputs: de     push de     sra d \ rr e     ld b,d \ ld c,e     call BC_Times_DE     push hl     ;x^2/4     sra h \ rr l     ex de,hl     ld b,d \ ld c,e     call BC_Times_DE     sra h \ rr l     inc h     ex (sp),hl    ;x^4/128+1 is on stack, HL=x^2/4     xor a     ld d,a     ld b,h     ld c,l     add hl,hl \ rla     add hl,hl \ rla     add hl,bc \ adc a,d     ld b,h     ld c,l     add hl,hl \ rla     add hl,hl \ rla     add hl,hl \ rla     add hl,hl \ rla     add hl,bc \ adc a,d     ld e,l     ld l,h     ld h,a     rl e     adc hl,hl     rl e     jr nc,\$+3     inc hl     pop de     ex hl,de     or a     sbc hl,de     ex de,hl     pop bc     jp BC_Times_DE ```

EDIT3: If you are willing to use a 256 byte LUT, this is a much, much faster fixed point lg() routine that I thought of on my walk home:

Code:
``` lg_88: ;Input: HL is a fixed point number ;Output: lg(H.L)->H.L ;Speed: Avg: 340    ld de,lgLUT    ld b,0    ld a,h    or a    ret m    ld a,l    jr z,\$+8    inc b \ srl h \ rra \ jr nz,\$-4    or a \ jr nz,\$+6    ld hl,8000h \ ret    rra \ inc b \ jr nc,\$-2 ;A is the element to look up in the LUT    ld l,a     ld c,h     dec b    add hl,hl    add hl,de    ld e,(hl)    inc hl    ld d,(hl)     ex de,hl    add hl,bc    ret lglut: .dw \$F800 .dw \$F996 .dw \$FA52 .dw \$FACF .dw \$FB2C .dw \$FB76 .dw \$FBB3 .dw \$FBE8 .dw \$FC16 .dw \$FC3F .dw \$FC64 .dw \$FC86 .dw \$FCA5 .dw \$FCC1 .dw \$FCDC .dw \$FCF4 .dw \$FD0B .dw \$FD21 .dw \$FD36 .dw \$FD49 .dw \$FD5C .dw \$FD6D .dw \$FD7E .dw \$FD8E .dw \$FD9D .dw \$FDAC .dw \$FDBA .dw \$FDC8 .dw \$FDD5 .dw \$FDE2 .dw \$FDEE .dw \$FDFA .dw \$FE06 .dw \$FE11 .dw \$FE1C .dw \$FE26 .dw \$FE31 .dw \$FE3B .dw \$FE44 .dw \$FE4E .dw \$FE57 .dw \$FE60 .dw \$FE69 .dw \$FE71 .dw \$FE7A .dw \$FE82 .dw \$FE8A .dw \$FE92 .dw \$FE9A .dw \$FEA1 .dw \$FEA9 .dw \$FEB0 .dw \$FEB7 .dw \$FEBE .dw \$FEC5 .dw \$FECB .dw \$FED2 .dw \$FED8 .dw \$FEDF .dw \$FEE5 .dw \$FEEB .dw \$FEF1 .dw \$FEF7 .dw \$FEFD .dw \$FF03 .dw \$FF09 .dw \$FF0E .dw \$FF14 .dw \$FF19 .dw \$FF1E .dw \$FF24 .dw \$FF29 .dw \$FF2E .dw \$FF33 .dw \$FF38 .dw \$FF3D .dw \$FF42 .dw \$FF47 .dw \$FF4B .dw \$FF50 .dw \$FF55 .dw \$FF59 .dw \$FF5E .dw \$FF62 .dw \$FF67 .dw \$FF6B .dw \$FF6F .dw \$FF74 .dw \$FF78 .dw \$FF7C .dw \$FF80 .dw \$FF84 .dw \$FF88 .dw \$FF8C .dw \$FF90 .dw \$FF94 .dw \$FF98 .dw \$FF9B .dw \$FF9F .dw \$FFA3 .dw \$FFA7 .dw \$FFAA .dw \$FFAE .dw \$FFB2 .dw \$FFB5 .dw \$FFB9 .dw \$FFBC .dw \$FFC0 .dw \$FFC3 .dw \$FFC6 .dw \$FFCA .dw \$FFCD .dw \$FFD0 .dw \$FFD4 .dw \$FFD7 .dw \$FFDA .dw \$FFDD .dw \$FFE0 .dw \$FFE4 .dw \$FFE7 .dw \$FFEA .dw \$FFED .dw \$FFF0 .dw \$FFF3 .dw \$FFF6 .dw \$FFF9 .dw \$FFFC .dw \$FFFF ```

It averages about 340 t-states and should be accurate to every bit. Being this fast, it might be better to combine this with the very fast multiplication routine above to get log base x Or for example, you can incorporate a natural log routine by simply multiplying lg(x) by 1/lg(e) (which is approximately 355/512):

Code:
``` ln_88: ;Input: HL is a fixed point number ;Output: ln(H.L)->H.L ;Speed: Avg: 340+(325 worst case)    call lg_88    ;now signed multiply HL by 355, then divide by 2 (rounding)     ld de,0     bit 7,h     jr z,\$+9     dec e \ xor a \ sub l \ ld l,a     sbc a,a \ sub h \ ld h,a     ld b,h     ld c,l     xor a    add hl,hl       add hl,hl \ rla    add hl,bc \ adc a,d    add hl,hl \ rla    add hl,bc \ adc a,d    add hl,hl \ rla    add hl,hl \ rla    add hl,hl \ rla    add hl,hl \ rla    add hl,bc \ adc a,d    add hl,hl \ rla    add hl,bc \ adc a,d     sra a \ rr h     ld l,h     ld h,a     inc e     ret nz     xor a \ sub l \ ld l,a     sbc a,a \ sub h \ ld h,a     ret ```

The overhead being at most 325 t-states, bringing it to an average of 665 t-states for ln() on (0,128)
To add to the table-driven routines, this is a much faster arctan routine, using a 158-byte table (it is <800 t-states on values >1, and about 350 for values on [0,1]) It is also accurate in every bit for most values, but I imagine there are a few that are off in the last bit:

Code:
``` arctan_88: ;Input: ;   D.E ;Output: atan(D.E)->D.E    push de    ld a,d    or a    jp p,\$+5    neg    ld d,a    dec a    jr nz,checkneedinv    inc e \ dec e \ jr nz,checkneedinv    pop af \ rla \ ld de,201 \ ret nc \ ld de,-201 \ ret checkneedinv:    inc a    call nz,DEgt1_Inv ;0.E is the value to atan    ld hl,adjustatan    push hl    ld a,e    cp 46 \ ret c    dec a \ cp 42h \ ret c    dec a \ cp 4Eh \ ret c    dec a \ cp 57h \ ret c    dec a \ cp 5Eh \ ret c    dec a \ cp 64h \ ret c    dec a \ cp 6Ah \ ret c    dec a \ cp 6Fh \ ret c    sub 6Fh \ ld e,a    ld hl,atanlut    add hl,de    ld a,(hl)    ret adjustatan:    ld e,a    pop bc    ld a,b    or a    jp p,\$+5    neg    jr z,\$+9    ld hl,402    or a    sbc hl,de    ex de,hl    rl b    ret nc    xor a    sub e    ld e,a    sbc a,a    sub d    ld d,a    ret DEgt1_Inv: ;Works if DE>1    ld hl,256    ld b,8 InvLoop:    add hl,hl    sbc hl,de    jr nc,\$+3    add hl,de    adc a,a    djnz InvLoop     cpl    ld e,a     ld d,b     ret atanlut: .db \$6F .db \$6F .db \$70 .db \$71 .db \$72 .db \$73 .db \$73 .db \$74 .db \$75 .db \$76 .db \$77 .db \$77 .db \$78 .db \$79 .db \$7A .db \$7B .db \$7B .db \$7C .db \$7D .db \$7E .db \$7F .db \$7F .db \$80 .db \$81 .db \$82 .db \$82 .db \$83 .db \$84 .db \$85 .db \$85 .db \$86 .db \$87 .db \$88 .db \$88 .db \$89 .db \$8A .db \$8B .db \$8B .db \$8C .db \$8D .db \$8E .db \$8E .db \$8F .db \$90 .db \$90 .db \$91 .db \$92 .db \$93 .db \$93 .db \$94 .db \$95 .db \$95 .db \$96 .db \$97 .db \$97 .db \$98 .db \$99 .db \$9A .db \$9A .db \$9B .db \$9C .db \$9C .db \$9D .db \$9E .db \$9E .db \$9F .db \$A0 .db \$A0 .db \$A1 .db \$A2 .db \$A2 .db \$A3 .db \$A3 .db \$A4 .db \$A5 .db \$A5 .db \$A6 .db \$A7 .db \$A7 .db \$A8 .db \$A9 .db \$A9 .db \$AA .db \$AA .db \$AB .db \$AC .db \$AC .db \$AD .db \$AD .db \$AE .db \$AF .db \$AF .db \$B0 .db \$B0 .db \$B1 .db \$B2 .db \$B2 .db \$B3 .db \$B3 .db \$B4 .db \$B5 .db \$B5 .db \$B6 .db \$B6 .db \$B7 .db \$B7 .db \$B8 .db \$B9 .db \$B9 .db \$BA .db \$BA .db \$BB .db \$BB .db \$BC .db \$BC .db \$BD .db \$BE .db \$BE .db \$BF .db \$BF .db \$C0 .db \$C0 .db \$C1 .db \$C1 .db \$C2 .db \$C2 .db \$C3 .db \$C3 .db \$C4 .db \$C4 .db \$C5 .db \$C6 .db \$C6 .db \$C7 .db \$C7 .db \$C8 .db \$C8 .db \$C9 ```

It is looking like table methods are the best option for 8.8 fixed point, especially if you can reduce the range of needed values to be on [0,1] (see the above post, last edit for very fast ln() and lg() using a table)

Similarly, sine and cosine can be generated from a single 201 byte array.
I wrote a pseudo-random number generator with period 4292870399 today that I was hoping would be comparable to the version used by the OS (and it appears to be so). It takes approximately 1592 t-states making it much faster than using the OS bcalls and generates a 3-byte random integer:

Code:
``` Rand24: ;Inputs: seed1,seed2 ;Outputs: ;   AHL is the pseudo-random number ;   seed1,seed2 incremented accordingly ;Destroys: BC,DE ;Notes: ; seed1*243+83 mod 65519 -> seed1 ; seed2*251+43 mod 65521 -> seed2 ; Output seed1*seed2    ld hl,(seed1)    xor a    ld b,h \ ld c,l    ld de,83    add hl,hl \ rla      ;2    add hl,bc \ adc a,d   ;3    add hl,hl \ rla      ;6    add hl,bc \ adc a,d   ;7    add hl,hl \ rla      ;14    add hl,bc \ adc a,d   ;15    add hl,hl \ rla      ;30    add hl,hl \ rla      ;60    add hl,hl \ rla      ;120    add hl,bc \ adc a,d   ;121    add hl,hl \ rla      ;242    add hl,bc \ adc a,d   ;243    add hl,de \ adc a,d   ;243x+83 ;now AHL mod 65519 ; Essentially, we can at least subtract A*65519=A(65536-17), so add A*17 to HL    ex de,hl    ld l,a    ld b,h    ld c,l    add hl,hl    add hl,hl    add hl,hl    add hl,hl    add hl,bc    add hl,de    ld de,65519    jr nc,\$+5    or a \ sbc hl,de    or a \ sbc hl,de    jr nc,\$+3    add hl,de    ld (seed1),hl ;seed1 done, now we need to do seed2    ld hl,(seed2) ; seed1*243+83 mod 65519 -> seed1 ; seed2*251+43 mod 65521 -> seed2 ; Output seed1*seed2    xor a    ld b,h \ ld c,l    ld de,43    add hl,hl \ rla      ;2    add hl,bc \ adc a,d   ;3    add hl,hl \ rla      ;6    add hl,bc \ adc a,d   ;7    add hl,hl \ rla      ;14    add hl,bc \ adc a,d   ;15    add hl,hl \ rla      ;30    add hl,bc \ adc a,d   ;31    add hl,hl \ rla      ;62    add hl,hl \ rla      ;124    add hl,bc \ adc a,d   ;125    add hl,hl \ rla      ;250    add hl,bc \ adc a,d   ;251    add hl,de \ adc a,d   ;251x+83 ;now AHL mod 65521 ; Essentially, we can at least subtract A*65521=A(65536-15), so add A*15 to HL    ex de,hl    ld l,a    ld b,h    ld c,l    add hl,hl    add hl,hl    add hl,hl    add hl,hl    sbc hl,bc    add hl,de    ld de,65521    jr nc,\$+5    or a \ sbc hl,de    or a \ sbc hl,de    jr nc,\$+3    add hl,de    ld (seed2),hl ;now seed1 and seed 2 are computed    ld bc,(seed1)    ex de,hl ;now do BC_times_DE    call BC_Times_DE       ex de,hl    ld l,b    ld h,0    ld b,h    ld c,l    add hl,hl    add hl,hl    add hl,bc    add hl,hl    add hl,hl    add hl,bc    add hl,hl    add hl,bc    ld c,d    ld d,e    ld e,a    ld a,c    sbc hl,bc    sbc a,b    ret nc    ld c,43    add hl,bc    ret BC_Times_DE: ;BHLA is the result    ld a,b    or a    ld hl,0    ld b,h ;1    add a,a    jr nc,\$+4    ld h,d    ld l,e ;2    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b ;227+10b-7p    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b ;=== ;AHL is the result of B*DE*256    push hl    ld h,b    ld l,b    ld b,a    ld a,c    ld c,h ;1    add a,a    jr nc,\$+4    ld h,d    ld l,e ;2    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c ;227+10b-7p    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    pop de ;Now BDE*256+AHL    ld c,a    ld a,l    ld l,h    ld h,c    add hl,de    ret nc    inc b ;BHLA is the 32-bit result    ret seed1:    .dw 0 seed2:    .dw 0 ```

Naturally, I plan to use this in my math libraries. To quickly describe the algorithm, I use 2 LCGs with relatively prime period lengths, generate the next terms for each of them, and multiply the results for the output (so if you really wanted to, you could get 31 bits out of BHLA, but the upper bit shouldn't be very reliable).

EDIT: Modified to perform a final mod 16777259 to provide better results.
What does your distribution look like? Seems as though you'll only be generating random composites.
Yes, that is what would be happening if you used the full 32-bit output of the multiplication in BHLA. That is one reason for why I suggested using bits from the lower 24 bits. Using bits 8 to 23 (in HL), I have generated some primes, such as 9767, just now.
Can you run it a bunch of times and show us the results? Maybe a nice pretty graph!
Yeah, a histogram would be great

I know nothing is labeled. I made it so the range was 0-499. X axis is the value returned by the random numbers, Y axis is the number of that number returned. 50000 numbers were generated.

As a comparison, here's the same with .net's random:
Yeah, I noticed some problems, too. Instead of just taking the lower 24 bits, it seems that it is better to mod the final result by an odd number (I tried the prime 16777259 and it yields better results). Currently, it is about 3 times more likely to return an even number compared to an odd number, but moding the final result by an odd number balances the output. It helps that large mods are relatively easy to perform.
Here it is with that:
Much better! Although since division is so expensive, what if you just drop the lowest 3 bits of that original output? What happens then?
Wouldn't that make it so the number 1 is never returned?
Actually, there would be no division in the sense you may be thinking about. The routine that I posted does two non-trivial modulos on 24-bit numbers and a 16x16->32 multiplication all in under 1500 t-states. Since it is constants we are using, there are some tricks that can be employed. For example, to do AHL mod 65529, I noted that 65529=65536-17, so if I subtracted A*65536, I would need to add A*17 which is much easier to do and I would only need to add that to the HL register. What is more, if the addition sets the carry flag, I can just subtract 65529 or add 17 to HL (which reminds me that I posted that unoptimised version that subtracts). Then you just add 17 again, if that sets the carry flag, it means HL was >=65529, so the addition was good, otherwise, subtract the 17.

This costs about 162 more cycles to do modulo 16777259 on BHLA and I will soon edit the post to reflect the new change.
Xeda112358 wrote:
Yeah, I noticed some problems, too. Instead of just taking the lower 24 bits, it seems that it is better to mod the final result by an odd number (I tried the prime 16777259 and it yields better results).

It's probably important that it's prime, not just odd, since otherwise you're not working on a field.
If that's of any use to anyone, here is a signed AHL = AHL / DE routine - I think it's pretty optimized (70 bytes) :

Code:
```sDiv2416:  xor d  push af  xor d  ld b, a  jp p, divdPos  xor a  sub l  ld l, a  ld a, 0  sbc a, h  ld h, a  ld a, 0  sbc a, b  ld b, a divdPos:  bit 7, d  jr z, divrPos  xor a  sub e  ld e, a  ld a, 0  sbc a, d  ld d, a  ld a, b divrPos:  push hl  pop ix  ld hl, 0  ld b, 24 divLoop:  add ix, ix  rla  adc hl, hl  sbc hl, de  jr nc, \$ + 4  add hl, de  ; jp nc, xxxx  ; it works because the carry can not be set and "inc ix" is 2 bytes  .db \$D2  inc ix  djnz divLoop  push ix  pop hl  ld b, a  pop af  ld a, b  ret p  xor a  sub l  ld l, a  ld a, 0  sbc a, h  ld h, a  ld a, 0  sbc a, b  ret```
Been writing code recently
https://www.cemetech.net/forum/viewtopic.php?t=10848 - Binary to BCD conversion/display for large values
https://www.cemetech.net/forum/viewtopic.php?t=10678 - Enabling and using half-res mode on the TI-84+ C Silver Edition

Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

»
» Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8  Next
» All times are GMT - 5 Hours

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum