So I made my own rolled version that is actually faster than the one they offered. It is also two bytes smaller than what I was using for Grammer (and 20% faster), while using fewer registers, so that's a huge boon:
Code:
sqrtDE:
;returns HL as the sqrt, DE as the remainder
;34 bytes
;min: 931cc
;max: 1123cc
;avg: 1027cc
;931+8{24,0}
ld bc,$8000
ld h,c
ld l,c
ld a,c
sqrt_loop:
srl b
rra
ld c,a
add hl,bc
ex de,hl
sbc hl,de
jr nc,+_
add hl,de
ex de,hl
or a
sbc hl,bc
.db $DA ;start of jp c,** which is 10cc to skip the next two bytes.
_:
ex de,hl
add hl,bc
srl h
rr l
srl b
rra
jr nc,sqrt_loop
ret
I wanted to see if my version was better than Axe's so that I could offer it as an optimization, so I looked it up and holy hell is Axe's a lot faster. I tracked back through the optimizations thread to find that it was none other than Runer112 who came up with such an amazing routine.
Here is the post and here is the slightly optimized code that later makes it into Axe (along with my notes on speed):
Quote:
Code:
p_Sqrt:
;766cc+8{0,6}
;min: 766cc
;max: 814cc
;avg: 790cc
ld a,l
ld l,h
ld de,$0040
ld h,d
ld b,8
or a
__SqrtLoop:
sbc hl,de
jr nc,__SqrtSkip
add hl,de
__SqrtSkip:
ccf
rl d
rla
adc hl,hl
rla
adc hl,hl
djnz __SqrtLoop
ld h,b
ld l,d
ret
__SqrtEnd:
Runer's routine is 237cc faster in the average case (about 23%) ! And now I know who to ask
EDIT: Hahaha, calc84maniac posted code later that is almost exactly the same as what I came up with. We even both used the JP C,** trick, but mine still ends up a bit faster and the same size just because I copied C to A outside the loop and just used A for the shifting.