Login [Register]
Don't have an account? Register now to chat, post, use our tools, and much more.
» Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8  Next
Here is another routine that I made yesterday for a pseudo-random number generator:

Code:
``` PseudoRandWord: ;Outputs: ;     BC was the previous pseudorandom number ;     HL is the pseudorandom number ;f(n+1)=(241f(n)+257) mod 65536   ;65536 ;181 cycles, add 17 if called      ld hl,(randSeed)      ld c,l      ld b,h      add hl,hl      add hl,bc      add hl,hl      add hl,bc      add hl,hl      add hl,bc      add hl,hl      add hl,hl      add hl,hl      add hl,hl      add hl,bc      inc h      inc hl      ld (randSeed),hl      ret ```

There are a few other nice features, too. For example, every 16-bit value is hit if you run this 65536 times. Or, if you only read 1 byte (for example, H from the output), it will hit every 8-bit number once if you run this 256 times. Plus, it can be seeded
EDIT: I found a bug and I managed to optimize out an average of roughly 600 cycles (I got that number from 500 simulations of randomized inputs and multiplying the eZ80 timings by 6. It might be faster.) The following code has been replaced.

I was optimising some algorithms and code in Grammer and I came across a pretty big speed optimisation for computing 16-bit GCD:

Code:
``` GCD16: ;;Inputs: HL,DE ;;Output: HL ;;        BC=0 ;;        DE=0 ;;        A=0 ;Binary GCD algorithm ;80 bytes     xor a     ld b,a     ld c,a     sbc hl,bc     ex de,hl     ret z     sbc hl,bc     ex de,hl     ret z ;factor out cofactor powers of two     ld a,l \ or e \ rra \ jr nc,\$+16     srl h \ rr l     srl d \ rr e     inc b     ld a,l \ or e \ rra \ jr nc,\$-12 .loop: ;factor out powers of 2 from hl     ld a,l \ rra \ ld a,h \ jr c,\$+10     rra \ rr l \ bit 0,l \ jr z,\$-5 \ ld h,a ;factor out powers of 2 from de     ld a,e \ rra \ ld a,d \ jr c,\$+10     rra \ rr e \ bit 0,e \ jr z,\$-5 \ ld d,a     xor a     sbc hl,de     jr z,finish     jp nc,loop     add hl,de     or a     ex de,hl     sbc hl,de     jp loop .finish:     ex de,hl     dec b     inc b     ret z     add hl,hl \ djnz \$-1 \ ret ```

I was using a slightly more naive approach before, computing the mod of two 16-bit values over and over (the Euclidean Algorithm). This code is pretty hefty, though, at 80 bytes. You can save two bytes by turning the JP instructions into JR, but there will be a tiny speed loss of a few cycles
Since the weekend, I have been trying to really optimise my square root algorithms for application in a 24-bit and 80-bit float library. During the process, I saw some major failings in my old routines (and those were still pretty fast). The biggest problem was my huge over estimate of the bit-size of certain intermediate values. After taking that into account, I can offer a new routine that is still being rigorously optimised, but is already quite fast:

Code:
``` SqrtHL5: ;input: HL ;Output: A ;63 bytes ;749 t-states worst case    ld bc,600h    ld d,c    ld e,c sqrt16loop:    add hl,hl \ rl e    add hl,hl \ rl e    rl c    ld a,c    rla    sub e    jr nc,\$+5    inc c    cpl    ld e,a    djnz sqrt16loop    sla c \ ld a,c \ add a,a    rl h \ rl e    rl h \ rl e    jr nc,\$+6    sub e \ jp \$+6    sub e    jr nc,\$+5    inc c    cpl    ld e,a        ld l,c       ld a,l    add hl,hl \ rl e \ rl d    add hl,hl \ rl e \ rl d    sbc hl,de    rla    ret ```

The biggest drawback is the size, since the last two iterations of the algorithm are handled outside the loop.
However, if you really want speed, this unrolled+optimised version is pretty darn fast at 555 t-states worst case:

Code:
``` ;------------------------------- ;Square Root ;Inputs: ;HL = number to be square rooted ;Outputs: ;A  = square root ;111 bytes ;555 t-states worst case SqrtHL5: ;zero some registers    xor a    ld c,a    ld d,a ;move the LSB of the input into E for later use, then shift the LSB into L and load H with 0. ;H will be a carry register, where the bits in L are rotated in    ld e,l    ld l,h    ld h,c ;Iteration 1 is optimised ; C is treated as the accumulator    add hl,hl    add hl,hl    sub h    jr nc,\$+5    inc c    cpl    ld h,a ;Iteration 2 ; rotate in 2 more bits from the MSB of the input into H    add hl,hl    add hl,hl ; shift the accumulator    rl c    ld a,c    rla ; A is now double the shifted accumulator    sub h ; doubles as a comparison of the carry register (H) to double the accumulator    jr nc,\$+5 ; If the carry is > 2*accumulator, the bit in the accumulator needs to be 1:    inc c ; We need to perform H-(2C+1), but A=2C-H. ; We could do NEG to get A=H-2C, then DEC A, but NEG = CPL \ INC A ; NEG \ DEC A  =  CPL \ INC A \ DEC A ; So just use CPL, saving 8 t-states, 1 byte    cpl    ld h,a ;Iteration 3    add hl,hl    add hl,hl    rl c    ld a,c    rla    sub h    jr nc,\$+5    inc c    cpl    ld h,a ;Iteration 4    add hl,hl    add hl,hl    rl c    ld a,c    rla    sub h    jr nc,\$+5    inc c    cpl    ld h,a ;L is 0, H is the current carry ;E is the lower 8 bits ; Load the next set of bits (LSB of input) into L so that they can be rotated into H    ld l,e ;Iteration 5    add hl,hl    add hl,hl    rl c    ld a,c    rla    sub h    jr nc,\$+5    inc c    cpl    ld h,a ;Iteration 6    add hl,hl    add hl,hl    rl c    ld a,c    rla    sub h    jr nc,\$+5    inc c    cpl    ld h,a ;Iteration 7 ; Now we need to start worrying about 8 bit overflow. ; In particular, the carry register, H should be ideally 9 bits for this iteration, 10 for the last. ; The accumulator, C, is 8 bits, but we need to compare H to 2*C, and 2*C is up to 9 bits on the last iteration. ;l has 4 more bits to rotate into h    sla c \ ld a,c \ add a,a    add hl,hl    add hl,hl    jr nc,\$+6    sub h \ jp \$+6    sub h    jr nc,\$+5    inc c    cpl    ld h,a ;Iteration 8 ; A lot of fancy stuff here ; D is 0, from way back at the beginning ; now I put H->E so that DE can hold the potentially 10-bit number ; Now C->A, L->H ; H thus has the last two bits of the input that need to be rotated into DE ; L has the value of the accumualtor which needs to be multiplied by 4 for a comparison to DE ; So 2 shifts of HL into DE results in DE holding the carry, HL holding 4*accumulated result!       ld e,h    ld h,l    ld l,c       ld a,l    add hl,hl \ rl e \ rl d    add hl,hl \ rl e \ rl d    sbc hl,de ;the c flag now has the state of the last bit of the result, HL does not need to be restored.    rla    ret ```

For comparison, the first is comparable to the more elegant/famous version I keep running into as I look for 'the best 16-bit square root routine', and the unrolled version is faster than any that I have run into yet.

Feel free to optimise because I am confident that there are still enough parts of code to change to improve the speed and size.
Not sure where it goes, but since the 84+C thread was filled with discussion about hardware, I didn't want to go off-topic.

So, here's a rather complete sprite drawing routine, with custom size (up to 256*256), custom coordinates, custom palette, no clipping, and that works with the regular TI-OS-set LCD windowing settings :
Code:
```; BC:X coordinate ; DE:Y coordinate ; H:width ; L:height ; IX:sprite ; palette:label must refer to your custom up-to-256-bytes palette drawSprite:    push   de    push   bc    ld   a, l    push   af    ld   a, 20h    out   (10h), a    out   (10h), a    ld   a, d    out   (11h), a    ld   a, e    out   (11h), a    ld   a, 21h    out   (10h), a    out   (10h), a    ld   a, b    out   (11h), a    ld   a, c    out   (11h), a    ld   a, 22h    out   (10h), a    out   (10h), a        ld   a, h drawSpriteXloop:    push   af    push   hl    ld   l, (ix + 0)    inc   ix    ld   h, 0    add   hl, hl    push   de    ld   de, palette    add   hl, de    ld   a, (hl)    inc   hl    ld   h, (hl)    ld   l, a    pop   de    ld a, h    out (11h), a    ld a, l    out (11h), a    pop   hl    pop   af    dec   a    jr   nz, drawSpriteXloop        inc   de    pop   af    dec   a    jr   nz, drawSprite + 3        pop   bc    pop   de    ret```

Pretty sure it can be heavily optimized. I'm not that good at ASM.
That's pretty nice, matrefeytontias; good work. If you're using Doors CSE, you can also use our highly-optimized 1-bit, 2-bit, 4-bit, 4-bit with 2x enlarging, or 8-bit routines. Runer112, who helped me frame these routines and figure out how to structure the arguments, has an eventual goal of finishing his clipping versions of these routines.
For my next trick, je vous presentez, logarithme naturel (in fixed point 8.8 format):

Code:
``` lognat: ;Input:  H.L needs to be on [1,2] ;Output: H.L if z flag is set, else if nz, no result ;avg speed: 677 t-states    dec h       dec h    jr nz,\$+8    inc l \ dec l    ret nz    ld l,177    ret    inc h    ret nz    ld b,h    ld c,l    ld e,l    ld d,8    add hl,hl    add hl,hl    add hl,de    ex de,hl    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ adc a,a    ld h,a \ ld l,b    sla h \ jr c,\$+3 \ ld l,c    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    rl l    ld a,h    adc a,b    ld l,a    ld h,b    cp a    ret ```

It only works accurately on [1,2], where I designed the algorithm. It is pretty fast, returning the result in 677 t-states or less, but its range is fairly limited (to 257 values).

I have been working on another routine that actually works on (0,128), but it takes upwards of about 1500 t-states and there are some accuracy issues I want to fix before posting.
EDIT:
I have the routine with a little better speed outlook and much better accuracy outside the [1,2] range than before (worst case is 2/256).

Code:
``` lognat: ;Input:  H.L needs to be on (0,128.0) ;Output: H.L if c flag set ;    returns nc if input is negative (HL not modified) ;Error: ;   The error on the outputs is as follows: ;      20592 inputs are exact ;      12075 inputs are off by 1/256 ;      100 inputs are off by 2/256 ;   So all 32767 inputs are within 2/256, with average error being <1/683 which is smaller than 1/256. ;Size: 177 bytes ;Speed: average speed is less than 1250 t-states    ld a,h \ or l \ jr nz,\$+5    ld h,80h \ ret    dec h    dec h    jr nz,\$+9    inc l \ dec l    jr nz,normalizeln-1    ld l,177    ret    inc h    jr nz,normalizeln    ld b,h    ld c,l    ld e,l    ld d,8    add hl,hl    add hl,hl    add hl,de    ex de,hl    call HL_Div_DE    ld h,a \ ld l,b    sla h \ jr c,\$+3 \ ld l,c    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    rl l    ld a,h    adc a,b    ld h,b    ld l,a    scf    ret HL_Div_DE:    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ adc a,a \ ret    inc h normalizeln:    xor a    inc h \ ret m    ld d,a \ ld e,a    ld a,l    jr z,toosmall    inc e \ srl h \ rra \ jr nz,\$-4    rla \ rl h    dec e stepin:    ld l,a    push de    call lognat    pop de    ;now multiply DE by 355, then divide by 2 (rounding)    ld b,d \ ld c,e \ ld a,d    ex de,hl    add hl,hl    add hl,hl   ;4    add hl,bc   ;5    add hl,hl   ;10    add hl,bc   ;11    add hl,hl   ;22    add hl,hl    add hl,hl    add hl,hl    add hl,bc    add hl,hl    add hl,bc    sra h \ rr l    adc hl,de    scf    ret toosmall:    dec d    dec e \ add a,a \ jr nc,\$-2    inc h    jp stepin ```

EDIT2:
Here are a bunch of fixed point 8.8 routines, mostly optimised for speed.

B.C*D.E→H.L

Code:
``` BC_Times_DE: ;  BHLA is the 32-bit result     ld a,b     or a     ld hl,0     ld b,h ;1     add a,a     jr nc,\$+4     ld h,d     ld l,e ;2     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b ;227+10b-7p     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b ;=== ;AHL is the result of B*DE*256     push hl     ld h,b     ld l,b     ld b,a     ld a,c     ld c,h ;1     add a,a     jr nc,\$+4     ld h,d     ld l,e ;2     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c ;227+10b-7p     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     pop de ;Now BDE*256+AHL     ld c,a     ld a,l     ld l,h     ld h,c     add hl,de     ret nc     inc b ;BHLA is the 32-bit result     ret ```

D.E/B.C→D.E

Code:
``` DE_Div_BC_88: ;Inputs: ;     DE,BC are 8.8 Fixed Point numbers ;Outputs: ;     DE is the 8.8 Fixed Point result (rounded to the least significant bit)      ld a,8      ld hl,0 Loop1:      rl d      adc hl,hl      sbc hl,bc      jr nc,\$+3     add hl,bc      dec a      jr nz,Loop1      ld d,e      ld e,a      ld a,16     jp \$+6 DivLoop:      add hl,bc      dec a      ret z      sla e      rl d      adc hl,hl     jr c,overflow      sbc hl,bc      jr c,DivLoop      inc e      jp DivLoop+1 overflow:     or a     sbc hl,bc     inc e     jp DivLoop ```

sqrt(H.L)→H.L

Code:
``` SqrtHL_prec12: ;input: HL ;Output: HL ;183 bytes     xor a     ld b,a     ld e,l     ld l,h     ld h,a     add hl,hl     add hl,hl     cp h     jr nc,\$+5     dec h     ld a,4     add hl,hl     add hl,hl     ld c,a     sub h     jr nc,\$+6     cpl     ld h,a     inc c     inc c     ld a,c     add hl,hl     add hl,hl     add a,a     ld c,a     sub h     jr nc,\$+6     cpl     ld h,a     inc c     inc c     ld a,c     add hl,hl     add hl,hl     add a,a     ld c,a     sub h     jr nc,\$+6     cpl     ld h,a     inc c     inc c     ld a,c     ld l,e     add hl,hl     add hl,hl     add a,a     ld c,a     sub h     jr nc,\$+6     cpl     ld h,a     inc c     inc c     ld a,c     add hl,hl     add hl,hl     add a,a     ld c,a     sub h     jr nc,\$+6     cpl     ld h,a     inc c     inc c     ld a,c     add a,a \ ld c,a     add hl,hl     add hl,hl     jr nc,\$+6     sub h \ jp \$+6     sub h     jr nc,\$+6     inc c \ inc c     cpl     ld h,a     ld a,l     ld l,h     add a,a     ld h,a     adc hl,hl     adc hl,hl     sll c \ rl b     sbc hl,bc     jr nc,\$+3     add hl,bc     sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a ;iteration 9     add hl,hl \ add hl,hl     sll c \ rl b     sbc hl,bc     jr nc,\$+3     add hl,bc     sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a     add hl,hl \ add hl,hl     sll c \ rl b     sbc hl,bc     jr nc,\$+3     add hl,bc     sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a     add hl,hl \ add hl,hl     sll c \ rl b     sbc hl,bc     jr nc,\$+3     add hl,bc     sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a     add hl,hl \ add hl,hl     sll c \ rl b     sbc hl,bc     jr nc,\$+3     add hl,bc     sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a ;12th iteration completed ; output in BC     srl b \ rr c     ld h,b     ld l,c     ret ```

2H.L→DEH.L

Code:
``` pow2: ;Inputs: ;     HL is the 8.8 fixed point number 'x' for 2^x ;Outputs: ;     DEHL is the 24.8 fixed point result. If there was overflow exceeding 2^24, then this value is set to the max.      ld a,l      or a      push hl     ;save H for later, H is the integer part of the power      ld hl,1      jr z,integerpow      scf      ;set the carry flag so that a bit is rotated into a. This will act as our counter. ;wait until we come across the lowest bit. Also note that we      rra      jr nc,\$-1      ld hl,2*256 powloop:      push af      call FPSqrtHL    ;returns in HL      pop af      srl a      jr z,integerpow      jr nc,powloop      add hl,hl      jp powloop integerpow:      pop bc ;Now b is the integer part for 2^x that we need to multiply HL by.      ld de,0      ld a,b      or a      ret z      add hl,hl      rl e \ rl d \ jr c,wayoverflow      djnz \$-7      ret wayoverflow:      ld hl,-1      ld d,h      ld e,l      ret ```

log2(H.L)→H.L

Code:
``` Log_2_88_size: ;Inputs: ;     HL is an unsigned 8.8 fixed point number. ;Outputs: ;     HL is the signed 8.8 fixed point value of log base 2 of the input. ;Example: ;     pass HL = 3.0, returns 1.58203125 (actual is ~1.584962501...) ;averages about 39 t-states slower than original ;62 bytes      ex de,hl      ld hl,0      ld a,d      ld c,8      or a      jr z,DE_lessthan_1      srl d      jr z,logloop-1      inc l      rr e      jr \$-7 DE_lessthan_1:      ld a,e      dec hl      or a      ret z      inc l      dec l      add a,a      jr nc,\$-2      ld e,a      inc d logloop:      add hl,hl      push hl      ld h,d      ld l,e      ld a,e      ld b,8      add hl,hl      rla      jr nc,\$+5        add hl,de        adc a,0      djnz \$-7      ld e,h      ld d,a      pop hl      rr a         ;this is NOT supposed to be rra, we need the z flag affected      jr z,\$+7        srl d        rr e        inc l      dec c      jr nz,logloop      ret ```

sin(D.E)→H.L (uses the BC_Times_DE routine above)

Code:
``` ine_88: ;Inputs: de     push de     sra d \ rr e     ld b,d \ ld c,e     call BC_Times_DE     push hl     ;x^2/4     sra h \ rr l     ex de,hl     ld b,d \ ld c,e     call BC_Times_DE     sra h \ rr l     inc h     ex (sp),hl    ;x^4/128+1 is on stack, HL=x^2/4     xor a     ld d,a     ld b,h     ld c,l     add hl,hl \ rla     add hl,hl \ rla     add hl,bc \ adc a,d     ld b,h     ld c,l     add hl,hl \ rla     add hl,hl \ rla     add hl,hl \ rla     add hl,hl \ rla     add hl,bc \ adc a,d     ld e,l     ld l,h     ld h,a     rl e     adc hl,hl     rl e     jr nc,\$+3     inc hl     pop de     ex hl,de     or a     sbc hl,de     ex de,hl     pop bc     jp BC_Times_DE ```

EDIT3: If you are willing to use a 256 byte LUT, this is a much, much faster fixed point lg() routine that I thought of on my walk home:

Code:
``` lg_88: ;Input: HL is a fixed point number ;Output: lg(H.L)->H.L ;Speed: Avg: 340    ld de,lgLUT    ld b,0    ld a,h    or a    ret m    ld a,l    jr z,\$+8    inc b \ srl h \ rra \ jr nz,\$-4    or a \ jr nz,\$+6    ld hl,8000h \ ret    rra \ inc b \ jr nc,\$-2 ;A is the element to look up in the LUT    ld l,a     ld c,h     dec b    add hl,hl    add hl,de    ld e,(hl)    inc hl    ld d,(hl)     ex de,hl    add hl,bc    ret lglut: .dw \$F800 .dw \$F996 .dw \$FA52 .dw \$FACF .dw \$FB2C .dw \$FB76 .dw \$FBB3 .dw \$FBE8 .dw \$FC16 .dw \$FC3F .dw \$FC64 .dw \$FC86 .dw \$FCA5 .dw \$FCC1 .dw \$FCDC .dw \$FCF4 .dw \$FD0B .dw \$FD21 .dw \$FD36 .dw \$FD49 .dw \$FD5C .dw \$FD6D .dw \$FD7E .dw \$FD8E .dw \$FD9D .dw \$FDAC .dw \$FDBA .dw \$FDC8 .dw \$FDD5 .dw \$FDE2 .dw \$FDEE .dw \$FDFA .dw \$FE06 .dw \$FE11 .dw \$FE1C .dw \$FE26 .dw \$FE31 .dw \$FE3B .dw \$FE44 .dw \$FE4E .dw \$FE57 .dw \$FE60 .dw \$FE69 .dw \$FE71 .dw \$FE7A .dw \$FE82 .dw \$FE8A .dw \$FE92 .dw \$FE9A .dw \$FEA1 .dw \$FEA9 .dw \$FEB0 .dw \$FEB7 .dw \$FEBE .dw \$FEC5 .dw \$FECB .dw \$FED2 .dw \$FED8 .dw \$FEDF .dw \$FEE5 .dw \$FEEB .dw \$FEF1 .dw \$FEF7 .dw \$FEFD .dw \$FF03 .dw \$FF09 .dw \$FF0E .dw \$FF14 .dw \$FF19 .dw \$FF1E .dw \$FF24 .dw \$FF29 .dw \$FF2E .dw \$FF33 .dw \$FF38 .dw \$FF3D .dw \$FF42 .dw \$FF47 .dw \$FF4B .dw \$FF50 .dw \$FF55 .dw \$FF59 .dw \$FF5E .dw \$FF62 .dw \$FF67 .dw \$FF6B .dw \$FF6F .dw \$FF74 .dw \$FF78 .dw \$FF7C .dw \$FF80 .dw \$FF84 .dw \$FF88 .dw \$FF8C .dw \$FF90 .dw \$FF94 .dw \$FF98 .dw \$FF9B .dw \$FF9F .dw \$FFA3 .dw \$FFA7 .dw \$FFAA .dw \$FFAE .dw \$FFB2 .dw \$FFB5 .dw \$FFB9 .dw \$FFBC .dw \$FFC0 .dw \$FFC3 .dw \$FFC6 .dw \$FFCA .dw \$FFCD .dw \$FFD0 .dw \$FFD4 .dw \$FFD7 .dw \$FFDA .dw \$FFDD .dw \$FFE0 .dw \$FFE4 .dw \$FFE7 .dw \$FFEA .dw \$FFED .dw \$FFF0 .dw \$FFF3 .dw \$FFF6 .dw \$FFF9 .dw \$FFFC .dw \$FFFF ```

It averages about 340 t-states and should be accurate to every bit. Being this fast, it might be better to combine this with the very fast multiplication routine above to get log base x Or for example, you can incorporate a natural log routine by simply multiplying lg(x) by 1/lg(e) (which is approximately 355/512):

Code:
``` ln_88: ;Input: HL is a fixed point number ;Output: ln(H.L)->H.L ;Speed: Avg: 340+(325 worst case)    call lg_88    ;now signed multiply HL by 355, then divide by 2 (rounding)     ld de,0     bit 7,h     jr z,\$+9     dec e \ xor a \ sub l \ ld l,a     sbc a,a \ sub h \ ld h,a     ld b,h     ld c,l     xor a    add hl,hl       add hl,hl \ rla    add hl,bc \ adc a,d    add hl,hl \ rla    add hl,bc \ adc a,d    add hl,hl \ rla    add hl,hl \ rla    add hl,hl \ rla    add hl,hl \ rla    add hl,bc \ adc a,d    add hl,hl \ rla    add hl,bc \ adc a,d     sra a \ rr h     ld l,h     ld h,a     inc e     ret nz     xor a \ sub l \ ld l,a     sbc a,a \ sub h \ ld h,a     ret ```

The overhead being at most 325 t-states, bringing it to an average of 665 t-states for ln() on (0,128)
To add to the table-driven routines, this is a much faster arctan routine, using a 158-byte table (it is <800 t-states on values >1, and about 350 for values on [0,1]) It is also accurate in every bit for most values, but I imagine there are a few that are off in the last bit:

Code:
``` arctan_88: ;Input: ;   D.E ;Output: atan(D.E)->D.E    push de    ld a,d    or a    jp p,\$+5    neg    ld d,a    dec a    jr nz,checkneedinv    inc e \ dec e \ jr nz,checkneedinv    pop af \ rla \ ld de,201 \ ret nc \ ld de,-201 \ ret checkneedinv:    inc a    call nz,DEgt1_Inv ;0.E is the value to atan    ld hl,adjustatan    push hl    ld a,e    cp 46 \ ret c    dec a \ cp 42h \ ret c    dec a \ cp 4Eh \ ret c    dec a \ cp 57h \ ret c    dec a \ cp 5Eh \ ret c    dec a \ cp 64h \ ret c    dec a \ cp 6Ah \ ret c    dec a \ cp 6Fh \ ret c    sub 6Fh \ ld e,a    ld hl,atanlut    add hl,de    ld a,(hl)    ret adjustatan:    ld e,a    pop bc    ld a,b    or a    jp p,\$+5    neg    jr z,\$+9    ld hl,402    or a    sbc hl,de    ex de,hl    rl b    ret nc    xor a    sub e    ld e,a    sbc a,a    sub d    ld d,a    ret DEgt1_Inv: ;Works if DE>1    ld hl,256    ld b,8 InvLoop:    add hl,hl    sbc hl,de    jr nc,\$+3    add hl,de    adc a,a    djnz InvLoop     cpl    ld e,a     ld d,b     ret atanlut: .db \$6F .db \$6F .db \$70 .db \$71 .db \$72 .db \$73 .db \$73 .db \$74 .db \$75 .db \$76 .db \$77 .db \$77 .db \$78 .db \$79 .db \$7A .db \$7B .db \$7B .db \$7C .db \$7D .db \$7E .db \$7F .db \$7F .db \$80 .db \$81 .db \$82 .db \$82 .db \$83 .db \$84 .db \$85 .db \$85 .db \$86 .db \$87 .db \$88 .db \$88 .db \$89 .db \$8A .db \$8B .db \$8B .db \$8C .db \$8D .db \$8E .db \$8E .db \$8F .db \$90 .db \$90 .db \$91 .db \$92 .db \$93 .db \$93 .db \$94 .db \$95 .db \$95 .db \$96 .db \$97 .db \$97 .db \$98 .db \$99 .db \$9A .db \$9A .db \$9B .db \$9C .db \$9C .db \$9D .db \$9E .db \$9E .db \$9F .db \$A0 .db \$A0 .db \$A1 .db \$A2 .db \$A2 .db \$A3 .db \$A3 .db \$A4 .db \$A5 .db \$A5 .db \$A6 .db \$A7 .db \$A7 .db \$A8 .db \$A9 .db \$A9 .db \$AA .db \$AA .db \$AB .db \$AC .db \$AC .db \$AD .db \$AD .db \$AE .db \$AF .db \$AF .db \$B0 .db \$B0 .db \$B1 .db \$B2 .db \$B2 .db \$B3 .db \$B3 .db \$B4 .db \$B5 .db \$B5 .db \$B6 .db \$B6 .db \$B7 .db \$B7 .db \$B8 .db \$B9 .db \$B9 .db \$BA .db \$BA .db \$BB .db \$BB .db \$BC .db \$BC .db \$BD .db \$BE .db \$BE .db \$BF .db \$BF .db \$C0 .db \$C0 .db \$C1 .db \$C1 .db \$C2 .db \$C2 .db \$C3 .db \$C3 .db \$C4 .db \$C4 .db \$C5 .db \$C6 .db \$C6 .db \$C7 .db \$C7 .db \$C8 .db \$C8 .db \$C9 ```

It is looking like table methods are the best option for 8.8 fixed point, especially if you can reduce the range of needed values to be on [0,1] (see the above post, last edit for very fast ln() and lg() using a table)

Similarly, sine and cosine can be generated from a single 201 byte array.
I wrote a pseudo-random number generator with period 4292870399 today that I was hoping would be comparable to the version used by the OS (and it appears to be so). It takes approximately 1592 t-states making it much faster than using the OS bcalls and generates a 3-byte random integer:

Code:
``` Rand24: ;Inputs: seed1,seed2 ;Outputs: ;   AHL is the pseudo-random number ;   seed1,seed2 incremented accordingly ;Destroys: BC,DE ;Notes: ; seed1*243+83 mod 65519 -> seed1 ; seed2*251+43 mod 65521 -> seed2 ; Output seed1*seed2    ld hl,(seed1)    xor a    ld b,h \ ld c,l    ld de,83    add hl,hl \ rla      ;2    add hl,bc \ adc a,d   ;3    add hl,hl \ rla      ;6    add hl,bc \ adc a,d   ;7    add hl,hl \ rla      ;14    add hl,bc \ adc a,d   ;15    add hl,hl \ rla      ;30    add hl,hl \ rla      ;60    add hl,hl \ rla      ;120    add hl,bc \ adc a,d   ;121    add hl,hl \ rla      ;242    add hl,bc \ adc a,d   ;243    add hl,de \ adc a,d   ;243x+83 ;now AHL mod 65519 ; Essentially, we can at least subtract A*65519=A(65536-17), so add A*17 to HL    ex de,hl    ld l,a    ld b,h    ld c,l    add hl,hl    add hl,hl    add hl,hl    add hl,hl    add hl,bc    add hl,de    ld de,65519    jr nc,\$+5    or a \ sbc hl,de    or a \ sbc hl,de    jr nc,\$+3    add hl,de    ld (seed1),hl ;seed1 done, now we need to do seed2    ld hl,(seed2) ; seed1*243+83 mod 65519 -> seed1 ; seed2*251+43 mod 65521 -> seed2 ; Output seed1*seed2    xor a    ld b,h \ ld c,l    ld de,43    add hl,hl \ rla      ;2    add hl,bc \ adc a,d   ;3    add hl,hl \ rla      ;6    add hl,bc \ adc a,d   ;7    add hl,hl \ rla      ;14    add hl,bc \ adc a,d   ;15    add hl,hl \ rla      ;30    add hl,bc \ adc a,d   ;31    add hl,hl \ rla      ;62    add hl,hl \ rla      ;124    add hl,bc \ adc a,d   ;125    add hl,hl \ rla      ;250    add hl,bc \ adc a,d   ;251    add hl,de \ adc a,d   ;251x+83 ;now AHL mod 65521 ; Essentially, we can at least subtract A*65521=A(65536-15), so add A*15 to HL    ex de,hl    ld l,a    ld b,h    ld c,l    add hl,hl    add hl,hl    add hl,hl    add hl,hl    sbc hl,bc    add hl,de    ld de,65521    jr nc,\$+5    or a \ sbc hl,de    or a \ sbc hl,de    jr nc,\$+3    add hl,de    ld (seed2),hl ;now seed1 and seed 2 are computed    ld bc,(seed1)    ex de,hl ;now do BC_times_DE    call BC_Times_DE       ex de,hl    ld l,b    ld h,0    ld b,h    ld c,l    add hl,hl    add hl,hl    add hl,bc    add hl,hl    add hl,hl    add hl,bc    add hl,hl    add hl,bc    ld c,d    ld d,e    ld e,a    ld a,c    sbc hl,bc    sbc a,b    ret nc    ld c,43    add hl,bc    ret BC_Times_DE: ;BHLA is the result    ld a,b    or a    ld hl,0    ld b,h ;1    add a,a    jr nc,\$+4    ld h,d    ld l,e ;2    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b ;227+10b-7p    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b ;=== ;AHL is the result of B*DE*256    push hl    ld h,b    ld l,b    ld b,a    ld a,c    ld c,h ;1    add a,a    jr nc,\$+4    ld h,d    ld l,e ;2    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c ;227+10b-7p    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    pop de ;Now BDE*256+AHL    ld c,a    ld a,l    ld l,h    ld h,c    add hl,de    ret nc    inc b ;BHLA is the 32-bit result    ret seed1:    .dw 0 seed2:    .dw 0 ```

Naturally, I plan to use this in my math libraries. To quickly describe the algorithm, I use 2 LCGs with relatively prime period lengths, generate the next terms for each of them, and multiply the results for the output (so if you really wanted to, you could get 31 bits out of BHLA, but the upper bit shouldn't be very reliable).

EDIT: Modified to perform a final mod 16777259 to provide better results.
What does your distribution look like? Seems as though you'll only be generating random composites.
Yes, that is what would be happening if you used the full 32-bit output of the multiplication in BHLA. That is one reason for why I suggested using bits from the lower 24 bits. Using bits 8 to 23 (in HL), I have generated some primes, such as 9767, just now.
Can you run it a bunch of times and show us the results? Maybe a nice pretty graph!
Yeah, a histogram would be great
I made a quick graph:

I know nothing is labeled. I made it so the range was 0-499. X axis is the value returned by the random numbers, Y axis is the number of that number returned. 50000 numbers were generated.

As a comparison, here's the same with .net's random:
Yeah, I noticed some problems, too. Instead of just taking the lower 24 bits, it seems that it is better to mod the final result by an odd number (I tried the prime 16777259 and it yields better results). Currently, it is about 3 times more likely to return an even number compared to an odd number, but moding the final result by an odd number balances the output. It helps that large mods are relatively easy to perform.
Here it is with that:
Much better! Although since division is so expensive, what if you just drop the lowest 3 bits of that original output? What happens then?
Wouldn't that make it so the number 1 is never returned?
Actually, there would be no division in the sense you may be thinking about. The routine that I posted does two non-trivial modulos on 24-bit numbers and a 16x16->32 multiplication all in under 1500 t-states. Since it is constants we are using, there are some tricks that can be employed. For example, to do AHL mod 65529, I noted that 65529=65536-17, so if I subtracted A*65536, I would need to add A*17 which is much easier to do and I would only need to add that to the HL register. What is more, if the addition sets the carry flag, I can just subtract 65529 or add 17 to HL (which reminds me that I posted that unoptimised version that subtracts). Then you just add 17 again, if that sets the carry flag, it means HL was >=65529, so the addition was good, otherwise, subtract the 17.

This costs about 162 more cycles to do modulo 16777259 on BHLA and I will soon edit the post to reflect the new change.
Xeda112358 wrote:
Yeah, I noticed some problems, too. Instead of just taking the lower 24 bits, it seems that it is better to mod the final result by an odd number (I tried the prime 16777259 and it yields better results).

It's probably important that it's prime, not just odd, since otherwise you're not working on a field.
If that's of any use to anyone, here is a signed AHL = AHL / DE routine - I think it's pretty optimized (70 bytes) :

Code:
```sDiv2416:  xor d  push af  xor d  ld b, a  jp p, divdPos  xor a  sub l  ld l, a  ld a, 0  sbc a, h  ld h, a  ld a, 0  sbc a, b  ld b, a divdPos:  bit 7, d  jr z, divrPos  xor a  sub e  ld e, a  ld a, 0  sbc a, d  ld d, a  ld a, b divrPos:  push hl  pop ix  ld hl, 0  ld b, 24 divLoop:  add ix, ix  rla  adc hl, hl  sbc hl, de  jr nc, \$ + 4  add hl, de  ; jp nc, xxxx  ; it works because the carry can not be set and "inc ix" is 2 bytes  .db \$D2  inc ix  djnz divLoop  push ix  pop hl  ld b, a  pop af  ld a, b  ret p  xor a  sub l  ld l, a  ld a, 0  sbc a, h  ld h, a  ld a, 0  sbc a, b  ret```

Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

»
» Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8  Next
» All times are GMT - 5 Hours

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Advertisement