Don't have an account? Register now to chat, post, use our tools, and much more.
Online Users
There are 127 users online: 9 members, 94 guests and 24 bots.
Members: Charty, gbl08ma, LuxenD, matrefeytontias, mrabourn, RajaBranco, ruler501, tifreak8x.
Bots: VoilaBot (8), Spinn3r (1), MSN/Bing (1), Magpie Crawler (1), VoilaBot (8), Googlebot (5).
SAX
 » Goto page Previous  1, 2, 3, 4, 5, 6
Author Message
Xeda112358

Power User

Joined: 09 Nov 2010
Posts: 390

 Posted: 03 Jul 2012 11:10:04 am    Post subject: Here is another routine that I made yesterday for a pseudo-random number generator: Code: ``` PseudoRandWord: ;Outputs: ;     BC was the previous pseudorandom number ;     HL is the pseudorandom number ;f(n+1)=(241f(n)+257) mod 65536   ;65536 ;181 cycles, add 17 if called      ld hl,(randSeed)      ld c,l      ld b,h      add hl,hl      add hl,bc      add hl,hl      add hl,bc      add hl,hl      add hl,bc      add hl,hl      add hl,hl      add hl,hl      add hl,hl      add hl,bc      inc h      inc hl      ld (randSeed),hl      ret ``` There are a few other nice features, too. For example, every 16-bit value is hit if you run this 65536 times. Or, if you only read 1 byte (for example, H from the output), it will hit every 8-bit number once if you run this 256 times. Plus, it can be seeded
Xeda112358

Power User

Joined: 09 Nov 2010
Posts: 390

 Posted: 07 Mar 2013 05:28:50 pm    Post subject: I was optimising some algorithms and code in Grammer and I came across a pretty big speed optimisation for computing 16-bit GCD: Code: ``` GCDDE_HL: ;Inputs: ;     HL,DE are the two values ;Outputs: ;     B is 0 ;     DE is 0 ;     HL is the GCD ;     C is not changed ;     A is not changed      ld b,1      or a CheckMax:               ;      sbc hl,de          ;ED52   15n      jr z,AdjustGCD     ;28**   12n-5      jr nc,ParityCheck  ;30**   12n-5      add hl,de      or a      ex de,hl ParityCheck:            ;      bit 0,e            ;CB**   8a      jr nz,DE_Odd       ;20**   12a-5b      bit 0,l            ;CB**   8b      jr z,BothEven      ;28**   12b-5c      rr d               ;CB**   8(n-a-b-c)      rr e               ;CB**   8(n-a-b-c)      jp CheckMax        ;C3**** 10(n-a-b-c) BothEven:               ;      inc b              ;04     4c      rr d \ rr e        ;       16c HL_Even:      rr h \ rr l        ;       16c      jp CheckMax        ;       10c DE_Odd:                 ;      bit 0,l            ;       8b      jr z,HL_Even       ;       12b-5d      sbc hl,de          ;       15d      rr h \ rr l        ;       16d      jp nz,CheckMax        ;       10d AdjustGCD:              ;      ex de,hl           ;       4      dec b              ;       4      ret z              ;       11+4(k>0)      add hl,hl          ;       11k      djnz \$-1           ;       13k-5      ret                ;       -- ``` I was using a slightly more naive approach before, computing the mod of two 16-bit values over and over (the Euclidean Algorithm). This code is pretty hefty, though, at 61 bytes. You can save three bytes by turning the JP instructions into JR, but there will be a tiny speed loss of a few cycles
Xeda112358

Power User

Joined: 09 Nov 2010
Posts: 390

 Posted: 23 Oct 2013 09:41:03 am    Post subject: Since the weekend, I have been trying to really optimise my square root algorithms for application in a 24-bit and 80-bit float library. During the process, I saw some major failings in my old routines (and those were still pretty fast). The biggest problem was my huge over estimate of the bit-size of certain intermediate values. After taking that into account, I can offer a new routine that is still being rigorously optimised, but is already quite fast: Code: ``` SqrtHL5: ;input: HL ;Output: A ;63 bytes ;749 t-states worst case    ld bc,600h    ld d,c    ld e,c sqrt16loop:    add hl,hl \ rl e    add hl,hl \ rl e    rl c    ld a,c    rla    sub e    jr nc,\$+5    inc c    cpl    ld e,a    djnz sqrt16loop    sla c \ ld a,c \ add a,a    rl h \ rl e    rl h \ rl e    jr nc,\$+6    sub e \ jp \$+6    sub e    jr nc,\$+5    inc c    cpl    ld e,a        ld l,c       ld a,l    add hl,hl \ rl e \ rl d    add hl,hl \ rl e \ rl d    sbc hl,de    rla    ret ``` The biggest drawback is the size, since the last two iterations of the algorithm are handled outside the loop. However, if you really want speed, this unrolled+optimised version is pretty darn fast at 555 t-states worst case: Code: ``` ;------------------------------- ;Square Root ;Inputs: ;HL = number to be square rooted ;Outputs: ;A  = square root ;111 bytes ;555 t-states worst case SqrtHL5: ;zero some registers    xor a    ld c,a    ld d,a ;move the LSB of the input into E for later use, then shift the LSB into L and load H with 0. ;H will be a carry register, where the bits in L are rotated in    ld e,l    ld l,h    ld h,c ;Iteration 1 is optimised ; C is treated as the accumulator    add hl,hl    add hl,hl    sub h    jr nc,\$+5    inc c    cpl    ld h,a ;Iteration 2 ; rotate in 2 more bits from the MSB of the input into H    add hl,hl    add hl,hl ; shift the accumulator    rl c    ld a,c    rla ; A is now double the shifted accumulator    sub h ; doubles as a comparison of the carry register (H) to double the accumulator    jr nc,\$+5 ; If the carry is > 2*accumulator, the bit in the accumulator needs to be 1:    inc c ; We need to perform H-(2C+1), but A=2C-H. ; We could do NEG to get A=H-2C, then DEC A, but NEG = CPL \ INC A ; NEG \ DEC A  =  CPL \ INC A \ DEC A ; So just use CPL, saving 8 t-states, 1 byte    cpl    ld h,a ;Iteration 3    add hl,hl    add hl,hl    rl c    ld a,c    rla    sub h    jr nc,\$+5    inc c    cpl    ld h,a ;Iteration 4    add hl,hl    add hl,hl    rl c    ld a,c    rla    sub h    jr nc,\$+5    inc c    cpl    ld h,a ;L is 0, H is the current carry ;E is the lower 8 bits ; Load the next set of bits (LSB of input) into L so that they can be rotated into H    ld l,e ;Iteration 5    add hl,hl    add hl,hl    rl c    ld a,c    rla    sub h    jr nc,\$+5    inc c    cpl    ld h,a ;Iteration 6    add hl,hl    add hl,hl    rl c    ld a,c    rla    sub h    jr nc,\$+5    inc c    cpl    ld h,a ;Iteration 7 ; Now we need to start worrying about 8 bit overflow. ; In particular, the carry register, H should be ideally 9 bits for this iteration, 10 for the last. ; The accumulator, C, is 8 bits, but we need to compare H to 2*C, and 2*C is up to 9 bits on the last iteration. ;l has 4 more bits to rotate into h    sla c \ ld a,c \ add a,a    add hl,hl    add hl,hl    jr nc,\$+6    sub h \ jp \$+6    sub h    jr nc,\$+5    inc c    cpl    ld h,a ;Iteration 8 ; A lot of fancy stuff here ; D is 0, from way back at the beginning ; now I put H->E so that DE can hold the potentially 10-bit number ; Now C->A, L->H ; H thus has the last two bits of the input that need to be rotated into DE ; L has the value of the accumualtor which needs to be multiplied by 4 for a comparison to DE ; So 2 shifts of HL into DE results in DE holding the carry, HL holding 4*accumulated result!       ld e,h    ld h,l    ld l,c       ld a,l    add hl,hl \ rl e \ rl d    add hl,hl \ rl e \ rl d    sbc hl,de ;the c flag now has the state of the last bit of the result, HL does not need to be restored.    rla    ret ``` For comparison, the first is comparable to the more elegant/famous version I keep running into as I look for 'the best 16-bit square root routine', and the unrolled version is faster than any that I have run into yet. Feel free to optimise because I am confident that there are still enough parts of code to change to improve the speed and size.
matrefeytontias

Joined: 03 Aug 2012
Posts: 192

 Posted: 05 Nov 2013 01:00:54 pm    Post subject: Not sure where it goes, but since the 84+C thread was filled with discussion about hardware, I didn't want to go off-topic. So, here's a rather complete sprite drawing routine, with custom size (up to 256*256), custom coordinates, custom palette, no clipping, and that works with the regular TI-OS-set LCD windowing settings :Code: ```; BC:X coordinate ; DE:Y coordinate ; H:width ; L:height ; IX:sprite ; palette:label must refer to your custom up-to-256-bytes palette drawSprite:    push   de    push   bc    ld   a, l    push   af    ld   a, 20h    out   (10h), a    out   (10h), a    ld   a, d    out   (11h), a    ld   a, e    out   (11h), a    ld   a, 21h    out   (10h), a    out   (10h), a    ld   a, b    out   (11h), a    ld   a, c    out   (11h), a    ld   a, 22h    out   (10h), a    out   (10h), a        ld   a, h drawSpriteXloop:    push   af    push   hl    ld   l, (ix + 0)    inc   ix    ld   h, 0    add   hl, hl    push   de    ld   de, palette    add   hl, de    ld   a, (hl)    inc   hl    ld   h, (hl)    ld   l, a    pop   de    ld a, h    out (11h), a    ld a, l    out (11h), a    pop   hl    pop   af    dec   a    jr   nz, drawSpriteXloop        inc   de    pop   af    dec   a    jr   nz, drawSprite + 3        pop   bc    pop   de    ret``` Pretty sure it can be heavily optimized. I'm not that good at ASM._________________/-\ >< |= \_/ 5 |= |?
KermMartian

Joined: 14 Mar 2005
Posts: 57259

 Posted: 05 Nov 2013 04:49:58 pm    Post subject: That's pretty nice, matrefeytontias; good work. If you're using Doors CSE, you can also use our highly-optimized 1-bit, 2-bit, 4-bit, 4-bit with 2x enlarging, or 8-bit routines. Runer112, who helped me frame these routines and figure out how to structure the arguments, has an eventual goal of finishing his clipping versions of these routines._________________
Xeda112358

Power User

Joined: 09 Nov 2010
Posts: 390

 Posted: 14 Nov 2013 06:09:56 pm    Post subject: For my next trick, je vous presentez, logarithme naturel (in fixed point 8.8 format): Code: ``` lognat: ;Input:  H.L needs to be on [1,2] ;Output: H.L if z flag is set, else if nz, no result ;avg speed: 677 t-states    dec h       dec h    jr nz,\$+8    inc l \ dec l    ret nz    ld l,177    ret    inc h    ret nz    ld b,h    ld c,l    ld e,l    ld d,8    add hl,hl    add hl,hl    add hl,de    ex de,hl    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ adc a,a    ld h,a \ ld l,b    sla h \ jr c,\$+3 \ ld l,c    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    rl l    ld a,h    adc a,b    ld l,a    ld h,b    cp a    ret ``` It only works accurately on [1,2], where I designed the algorithm. It is pretty fast, returning the result in 677 t-states or less, but its range is fairly limited (to 257 values). I have been working on another routine that actually works on (0,128), but it takes upwards of about 1500 t-states and there are some accuracy issues I want to fix before posting. EDIT: I have the routine with a little better speed outlook and much better accuracy outside the [1,2] range than before (worst case is 2/256). Code: ``` lognat: ;Input:  H.L needs to be on (0,128.0) ;Output: H.L if c flag set ;    returns nc if input is negative (HL not modified) ;Error: ;   The error on the outputs is as follows: ;      20592 inputs are exact ;      12075 inputs are off by 1/256 ;      100 inputs are off by 2/256 ;   So all 32767 inputs are within 2/256, with average error being <1/683 which is smaller than 1/256. ;Size: 177 bytes ;Speed: average speed is less than 1250 t-states    ld a,h \ or l \ jr nz,\$+5    ld h,80h \ ret    dec h    dec h    jr nz,\$+9    inc l \ dec l    jr nz,normalizeln-1    ld l,177    ret    inc h    jr nz,normalizeln    ld b,h    ld c,l    ld e,l    ld d,8    add hl,hl    add hl,hl    add hl,de    ex de,hl    call HL_Div_DE    ld h,a \ ld l,b    sla h \ jr c,\$+3 \ ld l,c    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    add hl,hl \ jr c,\$+3 \ add hl,bc    rl l    ld a,h    adc a,b    ld h,b    ld l,a    scf    ret HL_Div_DE:    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a    add hl,hl \ sbc hl,de \ adc a,a \ ret    inc h normalizeln:    xor a    inc h \ ret m    ld d,a \ ld e,a    ld a,l    jr z,toosmall    inc e \ srl h \ rra \ jr nz,\$-4    rla \ rl h    dec e stepin:    ld l,a    push de    call lognat    pop de    ;now multiply DE by 355, then divide by 2 (rounding)    ld b,d \ ld c,e \ ld a,d    ex de,hl    add hl,hl    add hl,hl   ;4    add hl,bc   ;5    add hl,hl   ;10    add hl,bc   ;11    add hl,hl   ;22    add hl,hl    add hl,hl    add hl,hl    add hl,bc    add hl,hl    add hl,bc    sra h \ rr l    adc hl,de    scf    ret toosmall:    dec d    dec e \ add a,a \ jr nc,\$-2    inc h    jp stepin ``` EDIT2: Here are a bunch of fixed point 8.8 routines, mostly optimised for speed. B.C*D.E→H.L Code: ``` BC_Times_DE: ;  BHLA is the 32-bit result     ld a,b     or a     ld hl,0     ld b,h ;1     add a,a     jr nc,\$+4     ld h,d     ld l,e ;2     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b ;227+10b-7p     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,b ;=== ;AHL is the result of B*DE*256     push hl     ld h,b     ld l,b     ld b,a     ld a,c     ld c,h ;1     add a,a     jr nc,\$+4     ld h,d     ld l,e ;2     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c ;227+10b-7p     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     add hl,hl     rla     jr nc,\$+4     add hl,de     adc a,c     pop de ;Now BDE*256+AHL     ld c,a     ld a,l     ld l,h     ld h,c     add hl,de     ret nc     inc b ;BHLA is the 32-bit result     ret ``` D.E/B.C→D.E Code: ``` DE_Div_BC_88: ;Inputs: ;     DE,BC are 8.8 Fixed Point numbers ;Outputs: ;     DE is the 8.8 Fixed Point result (rounded to the least significant bit)      ld a,8      ld hl,0 Loop1:      rl d      adc hl,hl      sbc hl,bc      jr nc,\$+3     add hl,bc      dec a      jr nz,Loop1      ld d,e      ld e,a      ld a,16     jp \$+6 DivLoop:      add hl,bc      dec a      ret z      sla e      rl d      adc hl,hl     jr c,overflow      sbc hl,bc      jr c,DivLoop      inc e      jp DivLoop+1 overflow:     or a     sbc hl,bc     inc e     jp DivLoop ``` sqrt(H.L)→H.L Code: ``` SqrtHL_prec12: ;input: HL ;Output: HL ;183 bytes     xor a     ld b,a     ld e,l     ld l,h     ld h,a     add hl,hl     add hl,hl     cp h     jr nc,\$+5     dec h     ld a,4     add hl,hl     add hl,hl     ld c,a     sub h     jr nc,\$+6     cpl     ld h,a     inc c     inc c     ld a,c     add hl,hl     add hl,hl     add a,a     ld c,a     sub h     jr nc,\$+6     cpl     ld h,a     inc c     inc c     ld a,c     add hl,hl     add hl,hl     add a,a     ld c,a     sub h     jr nc,\$+6     cpl     ld h,a     inc c     inc c     ld a,c     ld l,e     add hl,hl     add hl,hl     add a,a     ld c,a     sub h     jr nc,\$+6     cpl     ld h,a     inc c     inc c     ld a,c     add hl,hl     add hl,hl     add a,a     ld c,a     sub h     jr nc,\$+6     cpl     ld h,a     inc c     inc c     ld a,c     add a,a \ ld c,a     add hl,hl     add hl,hl     jr nc,\$+6     sub h \ jp \$+6     sub h     jr nc,\$+6     inc c \ inc c     cpl     ld h,a     ld a,l     ld l,h     add a,a     ld h,a     adc hl,hl     adc hl,hl     sll c \ rl b     sbc hl,bc     jr nc,\$+3     add hl,bc     sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a ;iteration 9     add hl,hl \ add hl,hl     sll c \ rl b     sbc hl,bc     jr nc,\$+3     add hl,bc     sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a     add hl,hl \ add hl,hl     sll c \ rl b     sbc hl,bc     jr nc,\$+3     add hl,bc     sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a     add hl,hl \ add hl,hl     sll c \ rl b     sbc hl,bc     jr nc,\$+3     add hl,bc     sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a     add hl,hl \ add hl,hl     sll c \ rl b     sbc hl,bc     jr nc,\$+3     add hl,bc     sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a ;12th iteration completed ; output in BC     srl b \ rr c     ld h,b     ld l,c     ret ``` 2H.L→DEH.L Code: ``` pow2: ;Inputs: ;     HL is the 8.8 fixed point number 'x' for 2^x ;Outputs: ;     DEHL is the 24.8 fixed point result. If there was overflow exceeding 2^24, then this value is set to the max.      ld a,l      or a      push hl     ;save H for later, H is the integer part of the power      ld hl,1      jr z,integerpow      scf      ;set the carry flag so that a bit is rotated into a. This will act as our counter. ;wait until we come across the lowest bit. Also note that we      rra      jr nc,\$-1      ld hl,2*256 powloop:      push af      call FPSqrtHL    ;returns in HL      pop af      srl a      jr z,integerpow      jr nc,powloop      add hl,hl      jp powloop integerpow:      pop bc ;Now b is the integer part for 2^x that we need to multiply HL by.      ld de,0      ld a,b      or a      ret z      add hl,hl      rl e \ rl d \ jr c,wayoverflow      djnz \$-7      ret wayoverflow:      ld hl,-1      ld d,h      ld e,l      ret ``` log2(H.L)→H.L Code: ``` Log_2_88_size: ;Inputs: ;     HL is an unsigned 8.8 fixed point number. ;Outputs: ;     HL is the signed 8.8 fixed point value of log base 2 of the input. ;Example: ;     pass HL = 3.0, returns 1.58203125 (actual is ~1.584962501...) ;averages about 39 t-states slower than original ;62 bytes      ex de,hl      ld hl,0      ld a,d      ld c,8      or a      jr z,DE_lessthan_1      srl d      jr z,logloop-1      inc l      rr e      jr \$-7 DE_lessthan_1:      ld a,e      dec hl      or a      ret z      inc l      dec l      add a,a      jr nc,\$-2      ld e,a      inc d logloop:      add hl,hl      push hl      ld h,d      ld l,e      ld a,e      ld b,8      add hl,hl      rla      jr nc,\$+5        add hl,de        adc a,0      djnz \$-7      ld e,h      ld d,a      pop hl      rr a         ;this is NOT supposed to be rra, we need the z flag affected      jr z,\$+7        srl d        rr e        inc l      dec c      jr nz,logloop      ret ``` sin(D.E)→H.L (uses the BC_Times_DE routine above) Code: ``` ine_88: ;Inputs: de     push de     sra d \ rr e     ld b,d \ ld c,e     call BC_Times_DE     push hl     ;x^2/4     sra h \ rr l     ex de,hl     ld b,d \ ld c,e     call BC_Times_DE     sra h \ rr l     inc h     ex (sp),hl    ;x^4/128+1 is on stack, HL=x^2/4     xor a     ld d,a     ld b,h     ld c,l     add hl,hl \ rla     add hl,hl \ rla     add hl,bc \ adc a,d     ld b,h     ld c,l     add hl,hl \ rla     add hl,hl \ rla     add hl,hl \ rla     add hl,hl \ rla     add hl,bc \ adc a,d     ld e,l     ld l,h     ld h,a     rl e     adc hl,hl     rl e     jr nc,\$+3     inc hl     pop de     ex hl,de     or a     sbc hl,de     ex de,hl     pop bc     jp BC_Times_DE ``` EDIT3: If you are willing to use a 256 byte LUT, this is a much, much faster fixed point lg() routine that I thought of on my walk home: Code: ``` lg_88: ;Input: HL is a fixed point number ;Output: lg(H.L)->H.L ;Speed: Avg: 340    ld de,lgLUT    ld b,0    ld a,h    or a    ret m    ld a,l    jr z,\$+8    inc b \ srl h \ rra \ jr nz,\$-4    or a \ jr nz,\$+6    ld hl,8000h \ ret    rra \ inc b \ jr nc,\$-2 ;A is the element to look up in the LUT    ld l,a     ld c,h     dec b    add hl,hl    add hl,de    ld e,(hl)    inc hl    ld d,(hl)     ex de,hl    add hl,bc    ret lglut: .dw \$F800 .dw \$F996 .dw \$FA52 .dw \$FACF .dw \$FB2C .dw \$FB76 .dw \$FBB3 .dw \$FBE8 .dw \$FC16 .dw \$FC3F .dw \$FC64 .dw \$FC86 .dw \$FCA5 .dw \$FCC1 .dw \$FCDC .dw \$FCF4 .dw \$FD0B .dw \$FD21 .dw \$FD36 .dw \$FD49 .dw \$FD5C .dw \$FD6D .dw \$FD7E .dw \$FD8E .dw \$FD9D .dw \$FDAC .dw \$FDBA .dw \$FDC8 .dw \$FDD5 .dw \$FDE2 .dw \$FDEE .dw \$FDFA .dw \$FE06 .dw \$FE11 .dw \$FE1C .dw \$FE26 .dw \$FE31 .dw \$FE3B .dw \$FE44 .dw \$FE4E .dw \$FE57 .dw \$FE60 .dw \$FE69 .dw \$FE71 .dw \$FE7A .dw \$FE82 .dw \$FE8A .dw \$FE92 .dw \$FE9A .dw \$FEA1 .dw \$FEA9 .dw \$FEB0 .dw \$FEB7 .dw \$FEBE .dw \$FEC5 .dw \$FECB .dw \$FED2 .dw \$FED8 .dw \$FEDF .dw \$FEE5 .dw \$FEEB .dw \$FEF1 .dw \$FEF7 .dw \$FEFD .dw \$FF03 .dw \$FF09 .dw \$FF0E .dw \$FF14 .dw \$FF19 .dw \$FF1E .dw \$FF24 .dw \$FF29 .dw \$FF2E .dw \$FF33 .dw \$FF38 .dw \$FF3D .dw \$FF42 .dw \$FF47 .dw \$FF4B .dw \$FF50 .dw \$FF55 .dw \$FF59 .dw \$FF5E .dw \$FF62 .dw \$FF67 .dw \$FF6B .dw \$FF6F .dw \$FF74 .dw \$FF78 .dw \$FF7C .dw \$FF80 .dw \$FF84 .dw \$FF88 .dw \$FF8C .dw \$FF90 .dw \$FF94 .dw \$FF98 .dw \$FF9B .dw \$FF9F .dw \$FFA3 .dw \$FFA7 .dw \$FFAA .dw \$FFAE .dw \$FFB2 .dw \$FFB5 .dw \$FFB9 .dw \$FFBC .dw \$FFC0 .dw \$FFC3 .dw \$FFC6 .dw \$FFCA .dw \$FFCD .dw \$FFD0 .dw \$FFD4 .dw \$FFD7 .dw \$FFDA .dw \$FFDD .dw \$FFE0 .dw \$FFE4 .dw \$FFE7 .dw \$FFEA .dw \$FFED .dw \$FFF0 .dw \$FFF3 .dw \$FFF6 .dw \$FFF9 .dw \$FFFC .dw \$FFFF ``` It averages about 340 t-states and should be accurate to every bit. Being this fast, it might be better to combine this with the very fast multiplication routine above to get log base x Or for example, you can incorporate a natural log routine by simply multiplying lg(x) by 1/lg(e) (which is approximately 355/512): Code: ``` ln_88: ;Input: HL is a fixed point number ;Output: ln(H.L)->H.L ;Speed: Avg: 340+(325 worst case)    call lg_88    ;now signed multiply HL by 355, then divide by 2 (rounding)     ld de,0     bit 7,h     jr z,\$+9     dec e \ xor a \ sub l \ ld l,a     sbc a,a \ sub h \ ld h,a     ld b,h     ld c,l     xor a    add hl,hl       add hl,hl \ rla    add hl,bc \ adc a,d    add hl,hl \ rla    add hl,bc \ adc a,d    add hl,hl \ rla    add hl,hl \ rla    add hl,hl \ rla    add hl,hl \ rla    add hl,bc \ adc a,d    add hl,hl \ rla    add hl,bc \ adc a,d     sra a \ rr h     ld l,h     ld h,a     inc e     ret nz     xor a \ sub l \ ld l,a     sbc a,a \ sub h \ ld h,a     ret ``` The overhead being at most 325 t-states, bringing it to an average of 665 t-states for ln() on (0,128)
Xeda112358

Power User

Joined: 09 Nov 2010
Posts: 390

 Posted: 15 Nov 2013 10:38:59 am    Post subject: To add to the table-driven routines, this is a much faster arctan routine, using a 158-byte table (it is <800 t-states on values >1, and about 350 for values on [0,1]) It is also accurate in every bit for most values, but I imagine there are a few that are off in the last bit: Code: ``` arctan_88: ;Input: ;   D.E ;Output: atan(D.E)->D.E    push de    ld a,d    or a    jp p,\$+5    neg    ld d,a    dec a    jr nz,checkneedinv    inc e \ dec e \ jr nz,checkneedinv    pop af \ rla \ ld de,201 \ ret nc \ ld de,-201 \ ret checkneedinv:    inc a    call nz,DEgt1_Inv ;0.E is the value to atan    ld hl,adjustatan    push hl    ld a,e    cp 46 \ ret c    dec a \ cp 42h \ ret c    dec a \ cp 4Eh \ ret c    dec a \ cp 57h \ ret c    dec a \ cp 5Eh \ ret c    dec a \ cp 64h \ ret c    dec a \ cp 6Ah \ ret c    dec a \ cp 6Fh \ ret c    sub 6Fh \ ld e,a    ld hl,atanlut    add hl,de    ld a,(hl)    ret adjustatan:    ld e,a    pop bc    ld a,b    or a    jp p,\$+5    neg    jr z,\$+9    ld hl,402    or a    sbc hl,de    ex de,hl    rl b    ret nc    xor a    sub e    ld e,a    sbc a,a    sub d    ld d,a    ret DEgt1_Inv: ;Works if DE>1    ld hl,256    ld b,8 InvLoop:    add hl,hl    sbc hl,de    jr nc,\$+3    add hl,de    adc a,a    djnz InvLoop     cpl    ld e,a     ld d,b     ret atanlut: .db \$6F .db \$6F .db \$70 .db \$71 .db \$72 .db \$73 .db \$73 .db \$74 .db \$75 .db \$76 .db \$77 .db \$77 .db \$78 .db \$79 .db \$7A .db \$7B .db \$7B .db \$7C .db \$7D .db \$7E .db \$7F .db \$7F .db \$80 .db \$81 .db \$82 .db \$82 .db \$83 .db \$84 .db \$85 .db \$85 .db \$86 .db \$87 .db \$88 .db \$88 .db \$89 .db \$8A .db \$8B .db \$8B .db \$8C .db \$8D .db \$8E .db \$8E .db \$8F .db \$90 .db \$90 .db \$91 .db \$92 .db \$93 .db \$93 .db \$94 .db \$95 .db \$95 .db \$96 .db \$97 .db \$97 .db \$98 .db \$99 .db \$9A .db \$9A .db \$9B .db \$9C .db \$9C .db \$9D .db \$9E .db \$9E .db \$9F .db \$A0 .db \$A0 .db \$A1 .db \$A2 .db \$A2 .db \$A3 .db \$A3 .db \$A4 .db \$A5 .db \$A5 .db \$A6 .db \$A7 .db \$A7 .db \$A8 .db \$A9 .db \$A9 .db \$AA .db \$AA .db \$AB .db \$AC .db \$AC .db \$AD .db \$AD .db \$AE .db \$AF .db \$AF .db \$B0 .db \$B0 .db \$B1 .db \$B2 .db \$B2 .db \$B3 .db \$B3 .db \$B4 .db \$B5 .db \$B5 .db \$B6 .db \$B6 .db \$B7 .db \$B7 .db \$B8 .db \$B9 .db \$B9 .db \$BA .db \$BA .db \$BB .db \$BB .db \$BC .db \$BC .db \$BD .db \$BE .db \$BE .db \$BF .db \$BF .db \$C0 .db \$C0 .db \$C1 .db \$C1 .db \$C2 .db \$C2 .db \$C3 .db \$C3 .db \$C4 .db \$C4 .db \$C5 .db \$C6 .db \$C6 .db \$C7 .db \$C7 .db \$C8 .db \$C8 .db \$C9 ``` It is looking like table methods are the best option for 8.8 fixed point, especially if you can reduce the range of needed values to be on [0,1] (see the above post, last edit for very fast ln() and lg() using a table) Similarly, sine and cosine can be generated from a single 201 byte array.
Xeda112358

Power User

Joined: 09 Nov 2010
Posts: 390

 Posted: 19 Nov 2013 12:21:54 pm    Post subject: I wrote a pseudo-random number generator with period 4292870399 today that I was hoping would be comparable to the version used by the OS (and it appears to be so). It takes approximately 1592 t-states making it much faster than using the OS bcalls and generates a 3-byte random integer: Code: ``` Rand24: ;Inputs: seed1,seed2 ;Outputs: ;   AHL is the pseudo-random number ;   seed1,seed2 incremented accordingly ;Destroys: BC,DE ;Notes: ; seed1*243+83 mod 65519 -> seed1 ; seed2*251+43 mod 65521 -> seed2 ; Output seed1*seed2    ld hl,(seed1)    xor a    ld b,h \ ld c,l    ld de,83    add hl,hl \ rla      ;2    add hl,bc \ adc a,d   ;3    add hl,hl \ rla      ;6    add hl,bc \ adc a,d   ;7    add hl,hl \ rla      ;14    add hl,bc \ adc a,d   ;15    add hl,hl \ rla      ;30    add hl,hl \ rla      ;60    add hl,hl \ rla      ;120    add hl,bc \ adc a,d   ;121    add hl,hl \ rla      ;242    add hl,bc \ adc a,d   ;243    add hl,de \ adc a,d   ;243x+83 ;now AHL mod 65519 ; Essentially, we can at least subtract A*65519=A(65536-17), so add A*17 to HL    ex de,hl    ld l,a    ld b,h    ld c,l    add hl,hl    add hl,hl    add hl,hl    add hl,hl    add hl,bc    add hl,de    ld de,65519    jr nc,\$+5    or a \ sbc hl,de    or a \ sbc hl,de    jr nc,\$+3    add hl,de    ld (seed1),hl ;seed1 done, now we need to do seed2    ld hl,(seed2) ; seed1*243+83 mod 65519 -> seed1 ; seed2*251+43 mod 65521 -> seed2 ; Output seed1*seed2    xor a    ld b,h \ ld c,l    ld de,43    add hl,hl \ rla      ;2    add hl,bc \ adc a,d   ;3    add hl,hl \ rla      ;6    add hl,bc \ adc a,d   ;7    add hl,hl \ rla      ;14    add hl,bc \ adc a,d   ;15    add hl,hl \ rla      ;30    add hl,bc \ adc a,d   ;31    add hl,hl \ rla      ;62    add hl,hl \ rla      ;124    add hl,bc \ adc a,d   ;125    add hl,hl \ rla      ;250    add hl,bc \ adc a,d   ;251    add hl,de \ adc a,d   ;251x+83 ;now AHL mod 65521 ; Essentially, we can at least subtract A*65521=A(65536-15), so add A*15 to HL    ex de,hl    ld l,a    ld b,h    ld c,l    add hl,hl    add hl,hl    add hl,hl    add hl,hl    sbc hl,bc    add hl,de    ld de,65521    jr nc,\$+5    or a \ sbc hl,de    or a \ sbc hl,de    jr nc,\$+3    add hl,de    ld (seed2),hl ;now seed1 and seed 2 are computed    ld bc,(seed1)    ex de,hl ;now do BC_times_DE    call BC_Times_DE       ex de,hl    ld l,b    ld h,0    ld b,h    ld c,l    add hl,hl    add hl,hl    add hl,bc    add hl,hl    add hl,hl    add hl,bc    add hl,hl    add hl,bc    ld c,d    ld d,e    ld e,a    ld a,c    sbc hl,bc    sbc a,b    ret nc    ld c,43    add hl,bc    ret BC_Times_DE: ;BHLA is the result    ld a,b    or a    ld hl,0    ld b,h ;1    add a,a    jr nc,\$+4    ld h,d    ld l,e ;2    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b ;227+10b-7p    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,b ;=== ;AHL is the result of B*DE*256    push hl    ld h,b    ld l,b    ld b,a    ld a,c    ld c,h ;1    add a,a    jr nc,\$+4    ld h,d    ld l,e ;2    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c ;227+10b-7p    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    add hl,hl    rla    jr nc,\$+4    add hl,de    adc a,c    pop de ;Now BDE*256+AHL    ld c,a    ld a,l    ld l,h    ld h,c    add hl,de    ret nc    inc b ;BHLA is the 32-bit result    ret seed1:    .dw 0 seed2:    .dw 0 ``` Naturally, I plan to use this in my math libraries. To quickly describe the algorithm, I use 2 LCGs with relatively prime period lengths, generate the next terms for each of them, and multiply the results for the output (so if you really wanted to, you could get 31 bits out of BHLA, but the upper bit shouldn't be very reliable). EDIT: Modified to perform a final mod 16777259 to provide better results.Last edited by Xeda112358 on 19 Nov 2013 10:36:01 pm; edited 2 times in total
elfprince13

OVER NINE THOUSAND!

Joined: 23 May 2005
Posts: 10552
Location: A galaxy far far away......

 Posted: 19 Nov 2013 12:34:21 pm    Post subject: What does your distribution look like? Seems as though you'll only be generating random composites._________________StickFigure Graphic Productions || VSHI: Vermont Sustainable Heating Initiative
Xeda112358

Power User

Joined: 09 Nov 2010
Posts: 390

 Posted: 19 Nov 2013 12:39:17 pm    Post subject: Yes, that is what would be happening if you used the full 32-bit output of the multiplication in BHLA. That is one reason for why I suggested using bits from the lower 24 bits. Using bits 8 to 23 (in HL), I have generated some primes, such as 9767, just now.
merthsoft

File Archiver

Joined: 09 May 2010
Posts: 3031

 Posted: 19 Nov 2013 12:45:45 pm    Post subject: Can you run it a bunch of times and show us the results? Maybe a nice pretty graph!_________________Shaun
elfprince13

OVER NINE THOUSAND!

Joined: 23 May 2005
Posts: 10552
Location: A galaxy far far away......

 Posted: 19 Nov 2013 02:28:49 pm    Post subject: Yeah, a histogram would be great _________________StickFigure Graphic Productions || VSHI: Vermont Sustainable Heating Initiative
merthsoft

File Archiver

Joined: 09 May 2010
Posts: 3031

 Posted: 19 Nov 2013 04:29:17 pm    Post subject: I made a quick graph: I know nothing is labeled. I made it so the range was 0-499. X axis is the value returned by the random numbers, Y axis is the number of that number returned. 50000 numbers were generated. As a comparison, here's the same with .net's random: _________________ShaunLast edited by merthsoft on 19 Nov 2013 04:41:47 pm; edited 1 time in total
Xeda112358

Power User

Joined: 09 Nov 2010
Posts: 390

 Posted: 19 Nov 2013 04:34:42 pm    Post subject: Yeah, I noticed some problems, too. Instead of just taking the lower 24 bits, it seems that it is better to mod the final result by an odd number (I tried the prime 16777259 and it yields better results). Currently, it is about 3 times more likely to return an even number compared to an odd number, but moding the final result by an odd number balances the output. It helps that large mods are relatively easy to perform.
merthsoft

File Archiver

Joined: 09 May 2010
Posts: 3031

 Posted: 19 Nov 2013 04:36:33 pm    Post subject: Here it is with that: _________________Shaun
KermMartian

Joined: 14 Mar 2005
Posts: 57259

 Posted: 19 Nov 2013 04:43:00 pm    Post subject: Much better! Although since division is so expensive, what if you just drop the lowest 3 bits of that original output? What happens then?_________________
merthsoft

File Archiver

Joined: 09 May 2010
Posts: 3031

 Posted: 19 Nov 2013 04:49:46 pm    Post subject: Wouldn't that make it so the number 1 is never returned?_________________Shaun
Xeda112358

Power User

Joined: 09 Nov 2010
Posts: 390

 Posted: 19 Nov 2013 10:32:39 pm    Post subject: Actually, there would be no division in the sense you may be thinking about. The routine that I posted does two non-trivial modulos on 24-bit numbers and a 16x16->32 multiplication all in under 1500 t-states. Since it is constants we are using, there are some tricks that can be employed. For example, to do AHL mod 65529, I noted that 65529=65536-17, so if I subtracted A*65536, I would need to add A*17 which is much easier to do and I would only need to add that to the HL register. What is more, if the addition sets the carry flag, I can just subtract 65529 or add 17 to HL (which reminds me that I posted that unoptimised version that subtracts). Then you just add 17 again, if that sets the carry flag, it means HL was >=65529, so the addition was good, otherwise, subtract the 17. This costs about 162 more cycles to do modulo 16777259 on BHLA and I will soon edit the post to reflect the new change.
elfprince13

OVER NINE THOUSAND!

Joined: 23 May 2005
Posts: 10552
Location: A galaxy far far away......

Posted: 19 Nov 2013 10:41:15 pm    Post subject:

 Xeda112358 wrote: Yeah, I noticed some problems, too. Instead of just taking the lower 24 bits, it seems that it is better to mod the final result by an odd number (I tried the prime 16777259 and it yields better results).

It's probably important that it's prime, not just odd, since otherwise you're not working on a field.
_________________
StickFigure Graphic Productions || VSHI: Vermont Sustainable Heating Initiative

 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

»
 » Goto page Previous  1, 2, 3, 4, 5, 6 Page 6 of 6 » All times are GMT - 5 Hours

 Jump to: Select a forum Announcements----------------Cemetech AnnouncementsCemetech LabsContests General Discussion----------------Technology & Calculator Open Topic-- Electronics/Hardware Development-- Calculator HardwareYour ProjectsGeneral Open Topic-- Politics & Rants-- Reviews-- Humour/Jokes Cemetech's Projects----------------Doors CS and Doors CSESourceCoderMember Features-- TI-Freak8x-- Merthsoft-- GrammerOther & Upcoming Projects-- The BASIC Elite-- mobileTunes-- CALCnet 2.2-- SuggestionsProgramming the TI-83 Plus/TI-84 PlusUsing the TI-83 Plus/TI-84 PlusWebsite Programming & Design----------------General Programmingz80 AssemblyTI-BASICCasio Prizm/FX Development & ProgrammingWebsite Markup & ScriptingGraphics1337 Programming Tips Building with Blocks----------------FreeBuild, LEGO, and Minecraft-- FreeBuild General-- Suggestions & Troubleshooting-- Content & Gallery-- Servers & Activities
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

© Copyright 2000-2013 Cemetech & Kerm Martian :: Page Execution Time: 0.063314 seconds.