Don't have an account? Register now to chat, post, use our tools, and much more.
» Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8  Next
Here is another routine that I made yesterday for a pseudo-random number generator:

Code:

PseudoRandWord:
;Outputs:
;     BC was the previous pseudorandom number
;     HL is the pseudorandom number
;f(n+1)=(241f(n)+257) mod 65536   ;65536
;181 cycles, add 17 if called
ld hl,(randSeed)
ld c,l
ld b,h
inc h
inc hl
ld (randSeed),hl
ret

There are a few other nice features, too. For example, every 16-bit value is hit if you run this 65536 times. Or, if you only read 1 byte (for example, H from the output), it will hit every 8-bit number once if you run this 256 times. Plus, it can be seeded
EDIT: I found a bug and I managed to optimize out an average of roughly 600 cycles (I got that number from 500 simulations of randomized inputs and multiplying the eZ80 timings by 6. It might be faster.) The following code has been replaced.

I was optimising some algorithms and code in Grammer and I came across a pretty big speed optimisation for computing 16-bit GCD:

Code:

GCD16:
;;Inputs: HL,DE
;;Output: HL
;;        BC=0
;;        DE=0
;;        A=0
;Binary GCD algorithm
;80 bytes
xor a
ld b,a
ld c,a
sbc hl,bc
ex de,hl
ret z
sbc hl,bc
ex de,hl
ret z

;factor out cofactor powers of two
ld a,l \ or e \ rra \ jr nc,\$+16
srl h \ rr l
srl d \ rr e
inc b
ld a,l \ or e \ rra \ jr nc,\$-12
.loop:
;factor out powers of 2 from hl
ld a,l \ rra \ ld a,h \ jr c,\$+10
rra \ rr l \ bit 0,l \ jr z,\$-5 \ ld h,a
;factor out powers of 2 from de
ld a,e \ rra \ ld a,d \ jr c,\$+10
rra \ rr e \ bit 0,e \ jr z,\$-5 \ ld d,a

xor a
sbc hl,de
jr z,finish
jp nc,loop
or a
ex de,hl
sbc hl,de
jp loop
.finish:
ex de,hl
dec b
inc b
ret z
add hl,hl \ djnz \$-1 \ ret

I was using a slightly more naive approach before, computing the mod of two 16-bit values over and over (the Euclidean Algorithm). This code is pretty hefty, though, at 80 bytes. You can save two bytes by turning the JP instructions into JR, but there will be a tiny speed loss of a few cycles
Since the weekend, I have been trying to really optimise my square root algorithms for application in a 24-bit and 80-bit float library. During the process, I saw some major failings in my old routines (and those were still pretty fast). The biggest problem was my huge over estimate of the bit-size of certain intermediate values. After taking that into account, I can offer a new routine that is still being rigorously optimised, but is already quite fast:

Code:

SqrtHL5:
;input: HL
;Output: A
;63 bytes
;749 t-states worst case
ld bc,600h
ld d,c
ld e,c
sqrt16loop:
rl c
ld a,c
rla
sub e
jr nc,\$+5
inc c
cpl
ld e,a
djnz sqrt16loop

sla c \ ld a,c \ add a,a
rl h \ rl e
rl h \ rl e
jr nc,\$+6
sub e \ jp \$+6
sub e
jr nc,\$+5
inc c
cpl
ld e,a

ld l,c
ld a,l
add hl,hl \ rl e \ rl d
add hl,hl \ rl e \ rl d
sbc hl,de
rla
ret

The biggest drawback is the size, since the last two iterations of the algorithm are handled outside the loop.
However, if you really want speed, this unrolled+optimised version is pretty darn fast at 555 t-states worst case:

Code:

;-------------------------------
;Square Root
;Inputs:
;HL = number to be square rooted
;Outputs:
;A  = square root
;111 bytes
;555 t-states worst case
SqrtHL5:
;zero some registers
xor a
ld c,a
ld d,a

;move the LSB of the input into E for later use, then shift the LSB into L and load H with 0.
;H will be a carry register, where the bits in L are rotated in
ld e,l
ld l,h
ld h,c

;Iteration 1 is optimised
; C is treated as the accumulator
sub h
jr nc,\$+5
inc c
cpl
ld h,a

;Iteration 2
; rotate in 2 more bits from the MSB of the input into H
; shift the accumulator
rl c
ld a,c
rla
; A is now double the shifted accumulator
sub h
; doubles as a comparison of the carry register (H) to double the accumulator
jr nc,\$+5
; If the carry is > 2*accumulator, the bit in the accumulator needs to be 1:
inc c
; We need to perform H-(2C+1), but A=2C-H.
; We could do NEG to get A=H-2C, then DEC A, but NEG = CPL \ INC A
; NEG \ DEC A  =  CPL \ INC A \ DEC A
; So just use CPL, saving 8 t-states, 1 byte
cpl
ld h,a

;Iteration 3
rl c
ld a,c
rla
sub h
jr nc,\$+5
inc c
cpl
ld h,a

;Iteration 4
rl c
ld a,c
rla
sub h
jr nc,\$+5
inc c
cpl
ld h,a

;L is 0, H is the current carry
;E is the lower 8 bits
; Load the next set of bits (LSB of input) into L so that they can be rotated into H
ld l,e

;Iteration 5
rl c
ld a,c
rla
sub h
jr nc,\$+5
inc c
cpl
ld h,a

;Iteration 6
rl c
ld a,c
rla
sub h
jr nc,\$+5
inc c
cpl
ld h,a

;Iteration 7
; Now we need to start worrying about 8 bit overflow.
; In particular, the carry register, H should be ideally 9 bits for this iteration, 10 for the last.
; The accumulator, C, is 8 bits, but we need to compare H to 2*C, and 2*C is up to 9 bits on the last iteration.
;l has 4 more bits to rotate into h

sla c \ ld a,c \ add a,a
jr nc,\$+6
sub h \ jp \$+6
sub h
jr nc,\$+5
inc c
cpl
ld h,a

;Iteration 8
; A lot of fancy stuff here
; D is 0, from way back at the beginning
; now I put H->E so that DE can hold the potentially 10-bit number
; Now C->A, L->H
; H thus has the last two bits of the input that need to be rotated into DE
; L has the value of the accumualtor which needs to be multiplied by 4 for a comparison to DE
; So 2 shifts of HL into DE results in DE holding the carry, HL holding 4*accumulated result!
ld e,h
ld h,l
ld l,c
ld a,l
add hl,hl \ rl e \ rl d
add hl,hl \ rl e \ rl d
sbc hl,de
;the c flag now has the state of the last bit of the result, HL does not need to be restored.
rla
ret

For comparison, the first is comparable to the more elegant/famous version I keep running into as I look for 'the best 16-bit square root routine', and the unrolled version is faster than any that I have run into yet.

Feel free to optimise because I am confident that there are still enough parts of code to change to improve the speed and size.
Not sure where it goes, but since the 84+C thread was filled with discussion about hardware, I didn't want to go off-topic.

So, here's a rather complete sprite drawing routine, with custom size (up to 256*256), custom coordinates, custom palette, no clipping, and that works with the regular TI-OS-set LCD windowing settings :
Code:
; BC:X coordinate
; DE:Y coordinate
; H:width
; L:height
; IX:sprite
; palette:label must refer to your custom up-to-256-bytes palette
drawSprite:
push   de
push   bc
ld   a, l
push   af

ld   a, 20h
out   (10h), a
out   (10h), a
ld   a, d
out   (11h), a
ld   a, e
out   (11h), a
ld   a, 21h
out   (10h), a
out   (10h), a
ld   a, b
out   (11h), a
ld   a, c
out   (11h), a
ld   a, 22h
out   (10h), a
out   (10h), a

ld   a, h
drawSpriteXloop:
push   af
push   hl
ld   l, (ix + 0)
inc   ix
ld   h, 0
push   de
ld   de, palette
ld   a, (hl)
inc   hl
ld   h, (hl)
ld   l, a
pop   de
ld a, h
out (11h), a
ld a, l
out (11h), a
pop   hl
pop   af
dec   a
jr   nz, drawSpriteXloop

inc   de
pop   af
dec   a
jr   nz, drawSprite + 3

pop   bc
pop   de
ret

Pretty sure it can be heavily optimized. I'm not that good at ASM.
That's pretty nice, matrefeytontias; good work. If you're using Doors CSE, you can also use our highly-optimized 1-bit, 2-bit, 4-bit, 4-bit with 2x enlarging, or 8-bit routines. Runer112, who helped me frame these routines and figure out how to structure the arguments, has an eventual goal of finishing his clipping versions of these routines.
For my next trick, je vous presentez, logarithme naturel (in fixed point 8.8 format):

Code:

lognat:
;Input:  H.L needs to be on [1,2]
;Output: H.L if z flag is set, else if nz, no result
;avg speed: 677 t-states
dec h
dec h
jr nz,\$+8
inc l \ dec l
ret nz
ld l,177
ret
inc h
ret nz
ld b,h
ld c,l
ld e,l
ld d,8
ex de,hl

ld h,a \ ld l,b
sla h \ jr c,\$+3 \ ld l,c
rl l
ld a,h
ld l,a
ld h,b
cp a
ret

It only works accurately on [1,2], where I designed the algorithm. It is pretty fast, returning the result in 677 t-states or less, but its range is fairly limited (to 257 values).

I have been working on another routine that actually works on (0,128), but it takes upwards of about 1500 t-states and there are some accuracy issues I want to fix before posting.
EDIT:
I have the routine with a little better speed outlook and much better accuracy outside the [1,2] range than before (worst case is 2/256).

Code:

lognat:
;Input:  H.L needs to be on (0,128.0)
;Output: H.L if c flag set
;    returns nc if input is negative (HL not modified)
;Error:
;   The error on the outputs is as follows:
;      20592 inputs are exact
;      12075 inputs are off by 1/256
;      100 inputs are off by 2/256
;   So all 32767 inputs are within 2/256, with average error being <1/683 which is smaller than 1/256.
;Size: 177 bytes
;Speed: average speed is less than 1250 t-states

ld a,h \ or l \ jr nz,\$+5
ld h,80h \ ret
dec h
dec h
jr nz,\$+9
inc l \ dec l
jr nz,normalizeln-1
ld l,177
ret
inc h
jr nz,normalizeln
ld b,h
ld c,l
ld e,l
ld d,8
ex de,hl
call HL_Div_DE
ld h,a \ ld l,b
sla h \ jr c,\$+3 \ ld l,c
rl l
ld a,h
ld h,b
ld l,a
scf
ret
HL_Div_DE:

inc h
normalizeln:
xor a
inc h \ ret m
ld d,a \ ld e,a
ld a,l
jr z,toosmall
inc e \ srl h \ rra \ jr nz,\$-4
rla \ rl h
dec e
stepin:
ld l,a
push de
call lognat
pop de
;now multiply DE by 355, then divide by 2 (rounding)
ld b,d \ ld c,e \ ld a,d
ex de,hl
sra h \ rr l
scf
ret
toosmall:
dec d
dec e \ add a,a \ jr nc,\$-2
inc h
jp stepin

EDIT2:
Here are a bunch of fixed point 8.8 routines, mostly optimised for speed.

B.C*D.E→H.L

Code:

BC_Times_DE:
;  BHLA is the 32-bit result
ld a,b
or a
ld hl,0
ld b,h
;1
jr nc,\$+4
ld h,d
ld l,e
;2
rla
jr nc,\$+4
;227+10b-7p
rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

;===
;AHL is the result of B*DE*256
push hl
ld h,b
ld l,b
ld b,a
ld a,c
ld c,h
;1
jr nc,\$+4
ld h,d
ld l,e
;2
rla
jr nc,\$+4
;227+10b-7p
rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

pop de
;Now BDE*256+AHL
ld c,a
ld a,l
ld l,h
ld h,c
ret nc
inc b
;BHLA is the 32-bit result
ret

D.E/B.C→D.E

Code:

DE_Div_BC_88:
;Inputs:
;     DE,BC are 8.8 Fixed Point numbers
;Outputs:
;     DE is the 8.8 Fixed Point result (rounded to the least significant bit)
ld a,8
ld hl,0
Loop1:
rl d
sbc hl,bc
jr nc,\$+3
dec a
jr nz,Loop1

ld d,e
ld e,a
ld a,16
jp \$+6
DivLoop:
dec a
ret z

sla e
rl d
jr c,overflow
sbc hl,bc
jr c,DivLoop
inc e
jp DivLoop+1
overflow:
or a
sbc hl,bc
inc e
jp DivLoop

sqrt(H.L)→H.L

Code:

SqrtHL_prec12:
;input: HL
;Output: HL
;183 bytes
xor a
ld b,a

ld e,l
ld l,h
ld h,a

cp h
jr nc,\$+5
dec h
ld a,4

ld c,a
sub h
jr nc,\$+6
cpl
ld h,a
inc c
inc c

ld a,c
ld c,a
sub h
jr nc,\$+6
cpl
ld h,a
inc c
inc c

ld a,c
ld c,a
sub h
jr nc,\$+6
cpl
ld h,a
inc c
inc c

ld a,c
ld l,e

ld c,a
sub h
jr nc,\$+6
cpl
ld h,a
inc c
inc c

ld a,c
ld c,a
sub h
jr nc,\$+6
cpl
ld h,a
inc c
inc c

ld a,c
jr nc,\$+6
sub h \ jp \$+6
sub h
jr nc,\$+6
inc c \ inc c
cpl
ld h,a

ld a,l
ld l,h
ld h,a
sll c \ rl b
sbc hl,bc
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

;iteration 9
sll c \ rl b
sbc hl,bc
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

sll c \ rl b
sbc hl,bc
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

sll c \ rl b
sbc hl,bc
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

sll c \ rl b
sbc hl,bc
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a
;12th iteration completed
; output in BC
srl b \ rr c
ld h,b
ld l,c
ret

2H.L→DEH.L

Code:

pow2:
;Inputs:
;     HL is the 8.8 fixed point number 'x' for 2^x
;Outputs:
;     DEHL is the 24.8 fixed point result. If there was overflow exceeding 2^24, then this value is set to the max.
ld a,l
or a
push hl     ;save H for later, H is the integer part of the power
ld hl,1
jr z,integerpow
scf      ;set the carry flag so that a bit is rotated into a. This will act as our counter.
;wait until we come across the lowest bit. Also note that we
rra
jr nc,\$-1
ld hl,2*256
powloop:
push af
call FPSqrtHL    ;returns in HL
pop af
srl a
jr z,integerpow
jr nc,powloop
jp powloop
integerpow:
pop bc
;Now b is the integer part for 2^x that we need to multiply HL by.
ld de,0
ld a,b
or a
ret z

rl e \ rl d \ jr c,wayoverflow
djnz \$-7
ret
wayoverflow:
ld hl,-1
ld d,h
ld e,l
ret

log2(H.L)→H.L

Code:

Log_2_88_size:
;Inputs:
;     HL is an unsigned 8.8 fixed point number.
;Outputs:
;     HL is the signed 8.8 fixed point value of log base 2 of the input.
;Example:
;     pass HL = 3.0, returns 1.58203125 (actual is ~1.584962501...)
;averages about 39 t-states slower than original
;62 bytes
ex de,hl
ld hl,0
ld a,d
ld c,8
or a
jr z,DE_lessthan_1
srl d
jr z,logloop-1
inc l
rr e
jr \$-7
DE_lessthan_1:
ld a,e
dec hl
or a
ret z
inc l
dec l
jr nc,\$-2
ld e,a

inc d
logloop:
push hl
ld h,d
ld l,e
ld a,e
ld b,8

rla
jr nc,\$+5
djnz \$-7

ld e,h
ld d,a
pop hl
rr a         ;this is NOT supposed to be rra, we need the z flag affected
jr z,\$+7
srl d
rr e
inc l
dec c
jr nz,logloop
ret

sin(D.E)→H.L (uses the BC_Times_DE routine above)

Code:

ine_88:
;Inputs: de
push de
sra d \ rr e
ld b,d \ ld c,e
call BC_Times_DE
push hl     ;x^2/4
sra h \ rr l
ex de,hl
ld b,d \ ld c,e
call BC_Times_DE
sra h \ rr l
inc h
ex (sp),hl    ;x^4/128+1 is on stack, HL=x^2/4
xor a
ld d,a
ld b,h
ld c,l
ld b,h
ld c,l
ld e,l
ld l,h
ld h,a
rl e
rl e
jr nc,\$+3
inc hl

pop de
ex hl,de
or a
sbc hl,de
ex de,hl
pop bc
jp BC_Times_DE

EDIT3: If you are willing to use a 256 byte LUT, this is a much, much faster fixed point lg() routine that I thought of on my walk home:

Code:

lg_88:
;Input: HL is a fixed point number
;Output: lg(H.L)->H.L
;Speed: Avg: 340
ld de,lgLUT
ld b,0
ld a,h
or a
ret m
ld a,l
jr z,\$+8
inc b \ srl h \ rra \ jr nz,\$-4
or a \ jr nz,\$+6
ld hl,8000h \ ret
rra \ inc b \ jr nc,\$-2
;A is the element to look up in the LUT
ld l,a
ld c,h
dec b
ld e,(hl)
inc hl
ld d,(hl)
ex de,hl
ret
lglut:
.dw \$F800
.dw \$F996
.dw \$FA52
.dw \$FACF
.dw \$FB2C
.dw \$FB76
.dw \$FBB3
.dw \$FBE8
.dw \$FC16
.dw \$FC3F
.dw \$FC64
.dw \$FC86
.dw \$FCA5
.dw \$FCC1
.dw \$FCDC
.dw \$FCF4
.dw \$FD0B
.dw \$FD21
.dw \$FD36
.dw \$FD49
.dw \$FD5C
.dw \$FD6D
.dw \$FD7E
.dw \$FD8E
.dw \$FD9D
.dw \$FDAC
.dw \$FDBA
.dw \$FDC8
.dw \$FDD5
.dw \$FDE2
.dw \$FDEE
.dw \$FDFA
.dw \$FE06
.dw \$FE11
.dw \$FE1C
.dw \$FE26
.dw \$FE31
.dw \$FE3B
.dw \$FE44
.dw \$FE4E
.dw \$FE57
.dw \$FE60
.dw \$FE69
.dw \$FE71
.dw \$FE7A
.dw \$FE82
.dw \$FE8A
.dw \$FE92
.dw \$FE9A
.dw \$FEA1
.dw \$FEA9
.dw \$FEB0
.dw \$FEB7
.dw \$FEBE
.dw \$FEC5
.dw \$FECB
.dw \$FED2
.dw \$FED8
.dw \$FEDF
.dw \$FEE5
.dw \$FEEB
.dw \$FEF1
.dw \$FEF7
.dw \$FEFD
.dw \$FF03
.dw \$FF09
.dw \$FF0E
.dw \$FF14
.dw \$FF19
.dw \$FF1E
.dw \$FF24
.dw \$FF29
.dw \$FF2E
.dw \$FF33
.dw \$FF38
.dw \$FF3D
.dw \$FF42
.dw \$FF47
.dw \$FF4B
.dw \$FF50
.dw \$FF55
.dw \$FF59
.dw \$FF5E
.dw \$FF62
.dw \$FF67
.dw \$FF6B
.dw \$FF6F
.dw \$FF74
.dw \$FF78
.dw \$FF7C
.dw \$FF80
.dw \$FF84
.dw \$FF88
.dw \$FF8C
.dw \$FF90
.dw \$FF94
.dw \$FF98
.dw \$FF9B
.dw \$FF9F
.dw \$FFA3
.dw \$FFA7
.dw \$FFAA
.dw \$FFAE
.dw \$FFB2
.dw \$FFB5
.dw \$FFB9
.dw \$FFBC
.dw \$FFC0
.dw \$FFC3
.dw \$FFC6
.dw \$FFCA
.dw \$FFCD
.dw \$FFD0
.dw \$FFD4
.dw \$FFD7
.dw \$FFDA
.dw \$FFDD
.dw \$FFE0
.dw \$FFE4
.dw \$FFE7
.dw \$FFEA
.dw \$FFED
.dw \$FFF0
.dw \$FFF3
.dw \$FFF6
.dw \$FFF9
.dw \$FFFC
.dw \$FFFF

It averages about 340 t-states and should be accurate to every bit. Being this fast, it might be better to combine this with the very fast multiplication routine above to get log base x Or for example, you can incorporate a natural log routine by simply multiplying lg(x) by 1/lg(e) (which is approximately 355/512):

Code:

ln_88:
;Input: HL is a fixed point number
;Output: ln(H.L)->H.L
;Speed: Avg: 340+(325 worst case)
call lg_88
;now signed multiply HL by 355, then divide by 2 (rounding)
ld de,0
bit 7,h
jr z,\$+9
dec e \ xor a \ sub l \ ld l,a
sbc a,a \ sub h \ ld h,a
ld b,h
ld c,l
xor a
sra a \ rr h
ld l,h
ld h,a
inc e
ret nz
xor a \ sub l \ ld l,a
sbc a,a \ sub h \ ld h,a
ret

The overhead being at most 325 t-states, bringing it to an average of 665 t-states for ln() on (0,128)
To add to the table-driven routines, this is a much faster arctan routine, using a 158-byte table (it is <800 t-states on values >1, and about 350 for values on [0,1]) It is also accurate in every bit for most values, but I imagine there are a few that are off in the last bit:

Code:

arctan_88:
;Input:
;   D.E
;Output: atan(D.E)->D.E
push de
ld a,d
or a
jp p,\$+5
neg
ld d,a
dec a
jr nz,checkneedinv
inc e \ dec e \ jr nz,checkneedinv
pop af \ rla \ ld de,201 \ ret nc \ ld de,-201 \ ret
checkneedinv:
inc a
call nz,DEgt1_Inv
;0.E is the value to atan
push hl
ld a,e
cp 46 \ ret c
dec a \ cp 42h \ ret c
dec a \ cp 4Eh \ ret c
dec a \ cp 57h \ ret c
dec a \ cp 5Eh \ ret c
dec a \ cp 64h \ ret c
dec a \ cp 6Ah \ ret c
dec a \ cp 6Fh \ ret c
sub 6Fh \ ld e,a
ld hl,atanlut
ld a,(hl)
ret
ld e,a
pop bc
ld a,b
or a
jp p,\$+5
neg
jr z,\$+9
ld hl,402
or a
sbc hl,de
ex de,hl
rl b
ret nc
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
ret

DEgt1_Inv:
;Works if DE>1
ld hl,256
ld b,8
InvLoop:
sbc hl,de
jr nc,\$+3
djnz InvLoop
cpl
ld e,a
ld d,b
ret
atanlut:
.db \$6F
.db \$6F
.db \$70
.db \$71
.db \$72
.db \$73
.db \$73
.db \$74
.db \$75
.db \$76
.db \$77
.db \$77
.db \$78
.db \$79
.db \$7A
.db \$7B
.db \$7B
.db \$7C
.db \$7D
.db \$7E
.db \$7F
.db \$7F
.db \$80
.db \$81
.db \$82
.db \$82
.db \$83
.db \$84
.db \$85
.db \$85
.db \$86
.db \$87
.db \$88
.db \$88
.db \$89
.db \$8A
.db \$8B
.db \$8B
.db \$8C
.db \$8D
.db \$8E
.db \$8E
.db \$8F
.db \$90
.db \$90
.db \$91
.db \$92
.db \$93
.db \$93
.db \$94
.db \$95
.db \$95
.db \$96
.db \$97
.db \$97
.db \$98
.db \$99
.db \$9A
.db \$9A
.db \$9B
.db \$9C
.db \$9C
.db \$9D
.db \$9E
.db \$9E
.db \$9F
.db \$A0
.db \$A0
.db \$A1
.db \$A2
.db \$A2
.db \$A3
.db \$A3
.db \$A4
.db \$A5
.db \$A5
.db \$A6
.db \$A7
.db \$A7
.db \$A8
.db \$A9
.db \$A9
.db \$AA
.db \$AA
.db \$AB
.db \$AC
.db \$AC
.db \$AE
.db \$AF
.db \$AF
.db \$B0
.db \$B0
.db \$B1
.db \$B2
.db \$B2
.db \$B3
.db \$B3
.db \$B4
.db \$B5
.db \$B5
.db \$B6
.db \$B6
.db \$B7
.db \$B7
.db \$B8
.db \$B9
.db \$B9
.db \$BA
.db \$BA
.db \$BB
.db \$BB
.db \$BC
.db \$BC
.db \$BD
.db \$BE
.db \$BE
.db \$BF
.db \$BF
.db \$C0
.db \$C0
.db \$C1
.db \$C1
.db \$C2
.db \$C2
.db \$C3
.db \$C3
.db \$C4
.db \$C4
.db \$C5
.db \$C6
.db \$C6
.db \$C7
.db \$C7
.db \$C8
.db \$C8
.db \$C9

It is looking like table methods are the best option for 8.8 fixed point, especially if you can reduce the range of needed values to be on [0,1] (see the above post, last edit for very fast ln() and lg() using a table)

Similarly, sine and cosine can be generated from a single 201 byte array.
I wrote a pseudo-random number generator with period 4292870399 today that I was hoping would be comparable to the version used by the OS (and it appears to be so). It takes approximately 1592 t-states making it much faster than using the OS bcalls and generates a 3-byte random integer:

Code:

Rand24:
;Inputs: seed1,seed2
;Outputs:
;   AHL is the pseudo-random number
;   seed1,seed2 incremented accordingly
;Destroys: BC,DE
;Notes:
; seed1*243+83 mod 65519 -> seed1
; seed2*251+43 mod 65521 -> seed2
; Output seed1*seed2
ld hl,(seed1)
xor a
ld b,h \ ld c,l
ld de,83
;now AHL mod 65519
; Essentially, we can at least subtract A*65519=A(65536-17), so add A*17 to HL
ex de,hl
ld l,a
ld b,h
ld c,l
ld de,65519
jr nc,\$+5
or a \ sbc hl,de
or a \ sbc hl,de
jr nc,\$+3
ld (seed1),hl
;seed1 done, now we need to do seed2
ld hl,(seed2)
; seed1*243+83 mod 65519 -> seed1
; seed2*251+43 mod 65521 -> seed2
; Output seed1*seed2
xor a
ld b,h \ ld c,l
ld de,43
;now AHL mod 65521
; Essentially, we can at least subtract A*65521=A(65536-15), so add A*15 to HL
ex de,hl
ld l,a
ld b,h
ld c,l
sbc hl,bc
ld de,65521
jr nc,\$+5
or a \ sbc hl,de
or a \ sbc hl,de
jr nc,\$+3
ld (seed2),hl
;now seed1 and seed 2 are computed
ld bc,(seed1)
ex de,hl
;now do BC_times_DE
call BC_Times_DE
ex de,hl
ld l,b
ld h,0
ld b,h
ld c,l
ld c,d
ld d,e
ld e,a
ld a,c
sbc hl,bc
sbc a,b
ret nc
ld c,43
ret
BC_Times_DE:
;BHLA is the result
ld a,b
or a
ld hl,0
ld b,h
;1
jr nc,\$+4
ld h,d
ld l,e
;2
rla
jr nc,\$+4
;227+10b-7p
rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

;===
;AHL is the result of B*DE*256
push hl
ld h,b
ld l,b
ld b,a
ld a,c
ld c,h
;1
jr nc,\$+4
ld h,d
ld l,e
;2
rla
jr nc,\$+4
;227+10b-7p
rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

rla
jr nc,\$+4

pop de
;Now BDE*256+AHL
ld c,a
ld a,l
ld l,h
ld h,c
ret nc
inc b
;BHLA is the 32-bit result
ret
seed1:
.dw 0
seed2:
.dw 0

Naturally, I plan to use this in my math libraries. To quickly describe the algorithm, I use 2 LCGs with relatively prime period lengths, generate the next terms for each of them, and multiply the results for the output (so if you really wanted to, you could get 31 bits out of BHLA, but the upper bit shouldn't be very reliable).

EDIT: Modified to perform a final mod 16777259 to provide better results.
What does your distribution look like? Seems as though you'll only be generating random composites.
Yes, that is what would be happening if you used the full 32-bit output of the multiplication in BHLA. That is one reason for why I suggested using bits from the lower 24 bits. Using bits 8 to 23 (in HL), I have generated some primes, such as 9767, just now.
Can you run it a bunch of times and show us the results? Maybe a nice pretty graph!
Yeah, a histogram would be great

I know nothing is labeled. I made it so the range was 0-499. X axis is the value returned by the random numbers, Y axis is the number of that number returned. 50000 numbers were generated.

As a comparison, here's the same with .net's random:
Yeah, I noticed some problems, too. Instead of just taking the lower 24 bits, it seems that it is better to mod the final result by an odd number (I tried the prime 16777259 and it yields better results). Currently, it is about 3 times more likely to return an even number compared to an odd number, but moding the final result by an odd number balances the output. It helps that large mods are relatively easy to perform.
Here it is with that:
Much better! Although since division is so expensive, what if you just drop the lowest 3 bits of that original output? What happens then?
Wouldn't that make it so the number 1 is never returned?
Actually, there would be no division in the sense you may be thinking about. The routine that I posted does two non-trivial modulos on 24-bit numbers and a 16x16->32 multiplication all in under 1500 t-states. Since it is constants we are using, there are some tricks that can be employed. For example, to do AHL mod 65529, I noted that 65529=65536-17, so if I subtracted A*65536, I would need to add A*17 which is much easier to do and I would only need to add that to the HL register. What is more, if the addition sets the carry flag, I can just subtract 65529 or add 17 to HL (which reminds me that I posted that unoptimised version that subtracts). Then you just add 17 again, if that sets the carry flag, it means HL was >=65529, so the addition was good, otherwise, subtract the 17.

This costs about 162 more cycles to do modulo 16777259 on BHLA and I will soon edit the post to reflect the new change.
Xeda112358 wrote:
Yeah, I noticed some problems, too. Instead of just taking the lower 24 bits, it seems that it is better to mod the final result by an odd number (I tried the prime 16777259 and it yields better results).

It's probably important that it's prime, not just odd, since otherwise you're not working on a field.
If that's of any use to anyone, here is a signed AHL = AHL / DE routine - I think it's pretty optimized (70 bytes) :

Code:
sDiv2416:
xor d
push af
xor d
ld b, a
jp p, divdPos
xor a
sub l
ld l, a
ld a, 0
sbc a, h
ld h, a
ld a, 0
sbc a, b
ld b, a
divdPos:
bit 7, d
jr z, divrPos
xor a
sub e
ld e, a
ld a, 0
sbc a, d
ld d, a
ld a, b
divrPos:

push hl
pop ix
ld hl, 0
ld b, 24
divLoop:
rla
sbc hl, de
jr nc, \$ + 4
; jp nc, xxxx
; it works because the carry can not be set and "inc ix" is 2 bytes
.db \$D2
inc ix
djnz divLoop

push ix
pop hl
ld b, a
pop af
ld a, b
ret p
xor a
sub l
ld l, a
ld a, 0
sbc a, h
ld h, a
ld a, 0
sbc a, b
ret

Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

»
» Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8  Next
» All times are GMT - 5 Hours

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum