So I made my own rolled version that is actually faster than the one they offered. It is also two bytes smaller than what I was using for Grammer (and 20% faster), while using fewer registers, so that's a huge boon:
;returns HL as the sqrt, DE as the remainder
.db $DA ;start of jp c,** which is 10cc to skip the next two bytes.
I wanted to see if my version was better than Axe's so that I could offer it as an optimization, so I looked it up and holy hell is Axe's a lot faster. I tracked back through the optimizations thread to find that it was none other than Runer112 who came up with such an amazing routine.
Here is the post and here is the slightly optimized code that later makes it into Axe (along with my notes on speed):
Runer's routine is 237cc faster in the average case (about 23%) ! And now I know who to ask
EDIT: Hahaha, calc84maniac posted code later that is almost exactly the same as what I came up with. We even both used the JP C,** trick, but mine still ends up a bit faster and the same size just because I copied C to A outside the loop and just used A for the shifting.