In which I rant angrily to the world about what appear to be very poor design decisions because TI never responds to anything we tell them, which makes it seem like they aren't listening, so why bother trying to talk?

Also, if any TI engineer actually reads this, I apologize for the exasperated tone. If I actually had faith someone at TI would read this, I would use a much more constructive tone.

Texas Instruments is a world-leader in chip design. They have some of the best engineers in the world. So why is that, over the past decade, every update to the TI-83 Plus series has had horrible performance problems?

Before I start with recent stuff, though, I want to mention a less-recent mistake. Earlier iterations of the Z80 calculator line had only a few interrupt sources, and in the 90s, glue logic circuitry was surely more expensive. So TI designed their glue logic to use the Z80's IM1 interrupt mode, which requires the CPU to query each possible interrupt source to figure out where the interrupt came from. With only four interrupts, this was not a major problem for the CPU to handle.

The TI-83 Plus SE started adding more interrupts for the both the D-bus link assist and the crystal timers. This wasn't too bad. But then came the TI-84 Plus (/SE). It added a ton of interrupts for the USB port. At this point, TI should have realized two things: first, it's no longer the 90s and minimizing logic was no longer necessary; and second, they had so many interrupt sources that checking all of them was getting to be excessive.

Fortunately, the Z80 CPU has a great way to address that exact problem: IM2! In IM2, the hardware sends the CPU a number telling it which interrupt source triggered the interrupt, and the CPU magically selects the right interrupt handler using that number. It costs some extra circuitry, but it makes handling interrupts much easier for the CPU. Naturally, with a growing problem in one hand and a ready-made solution for that exact problem in the other, TI decided to continue using the old, slower system. (Admittedly, this has never been a serious performance problem, but it is still a missed opportunity.)

Now, the first recent problem was the TI-84 Plus C SE. There was nothing wrong with the CPU per se, but it should have been obvious that asking the same old 15 MHz Z80 to drive two hundred times more data to the screen without any graphics acceleration technology was never going to be viable. Even the best-case scenario of filling the screen with a solid color would take around 150 ms; if you actually wanted to display anything interesting, performance would necessarily be far worse.

The software team made a valiant effort to solve the problem with algorithms, but no algorithm could defeat the fact that they needed to push too much data. Sprite-based games could run pretty well if you didn't need anything interesting to happen in the background, but any kind of full screen graphics is simply impossible. Switching screens? Painfully slow. Scrolling text? That's a horrible idea, so they did it anyway in the program editor and various menus.

I have to imagine that the engineering team warned there would be performance problems. And I'm sure nearly every tester's feedback included a complaint about how sluggish the new model was. Why did management ignore these concerns and press on?

I pointed out on Cemetech that there were binary-compatible CPUs with much better IPC performance. I don't know if anyone at TI saw those comments, but apparently continued complaints about performance finally convinced TI to actually design some new hardware.

And so came the TI-84 Plus CE. TI ditched the old Z80 and moved to the eZ80, a binary-compatible evolution of the Z80 featuring a tight pipeline with the same classic 8-bit data bus and a 24-bit address space. The eZ80 is, in fact, so well-pipelined that its performance is strictly limited by the 8-bit data bus width. Retaining the old 8-bit data bus makes it easy to adapt old Z80-based designs to the new CPU, but it is also the bottleneck of the CPU. The eZ80 also has no cache, although it's unclear how much a cache could help when the net bandwidth through the 8-bit data bus remains small. Thus, like the Z80 before it, the eZ80 is designed to work on a low-bandwidth, low-latency bus.

So, knowing that the eZ80 is specifically designed for use on a low-bandwidth, low-latency bus, what does TI do with it? They put the eZ80 on a high-bandwidth, high-latency bus intended for use with ARM CPUs. Because the CPU has only an 8-bit interface to the outside world, it can't take advantage of the high bandwidth provided. And, because the CPU has no cache, it experiences the full, unmitigated brunt of the high-latency bus. Thus, TI took their shiny-new high-performance CPU and found the worst possible way to use it.

Let me just make this as clear as I can: On the eZ80, by far the most important factor affecting performance is not bus bandwidth, but latency. For the eZ80, bus latency must be MINIMIZED AT ALL COSTS!

Why would they do this? That, at least, seems fairly obvious: the TI-84 Plus CE uses a lot of peripheral logic blocks (such as for the keyboard and USB) designed for use with ARM CPUs, which reduces engineering costs. Unto itself, it's not a bad design decision. The CPU doesn't usually spend a lot of time talking with peripherals. It's a perfectly reasonable thing to do---so long as the CPU's access to memory doesn't have to go through that high-latency interface. Unfortunately, that is not what TI did.

How exactly does this impact the CPU's performance? Ideally, the eZ80 is used with zero-latency memory, meaning that when the CPU requests a read, for example, the data comes back on the same clock cycle. If every read requires one extra wait state, then every read effectively takes twice as long. If every read requires two wait states, then every read takes three times as long. Every instruction requires the CPU to read, at a minimum, each byte of the opcode; some instructions also fetch additional data bytes from elsewhere. Because of the small bus and tight pipelining of the eZ80, if every read and every write requires one extra wait state, the CPU's effective throughput is halved, as though the CPU's clock speed is only half of what it really is, because the CPU is forced to spend half its time waiting, doing nothing at all. If every read and every write requires two extra wait states, then throughput is reduced to just a third of what the CPU is capable of.

How many wait states does the CPU have to contend with on the TI-84 Plus CE? Well, for RAM, every read incurs a three wait state penalty and every write one wait state, so when running code from RAM that only touches RAM, the CPU's best possible performance is only around a quarter to a third of what it could be. But it gets worse.

The original models in the Z80 family all used execute-in-place ROM or flash memory, meaning that every time the CPU needed a byte from ROM, it had to fetch that byte anew. On the Z80 models, this would ultimately prove to be the limiting factor, as the flash chips are specified for a maximum of 20 MHz operation, while the CPU itself can actually handle up to around 25 MHz. (TI seems to have decided to play it safe by not exceeding 16 MHz, a common engineering practice known as derating.)

The original design of the TI-84 Plus CE uses the same old style of parallel flash memory. Even though the eZ80 is specified for a maximum speed of 50 MHz, the flash memory interface meant that the CPU could never possibly exceed performance equivalent to running at about 20 MHz. But remember that three-wait-state penalty for RAM access? Flash memory accesses are also forced to go through two or three wait states before the request even gets to the flash chip, plus four more states before the flash can reply, and another two before the result gets returned to the CPU. The net result is a massive nine wait states for every byte read from flash. Nine! The result is the CPU spending somewhere around 80 to 90 % of its time just sitting idle waiting for some logic block to finally get it the data it needs.

At some point, I complained to an engineer at TI about this. They seem to have actually paid attention (without ever actually telling us they would fix the issue), because in 2019, they released a new hardware revision that substantially improves performance by adding a cache. My own benchmarks show this new design is about two to three times faster depending on the specific task the OS is doing. That's great! Except, once again, TI managed to hamper the new, improved design with avoidable bottlenecks.

The first problem with the new design is what they didn't fix: the RAM latency. My guess is that they wanted the RAM to still be accessible through an ARM bus because some of their logic blocks (namely the LCD and USB) need DMA capability. However, CPUs need frequent access to RAM; there's no point in a CPU without RAM! So the RAM access bottleneck is worth solving. Assuming that DMA capability really is the concern here, TI should have split RAM into two banks, a zero-wait-state bank exclusively for the CPU, and a DMA-capable RAM bank for the 150 K of VRAM and maybe an extra 10 K for USB DMA.

The second problem with the new design is that they decided it was time to move on from using old parallel-style flash memories and switch to modern high-bandwidth high-latency serial flash. Once again, that unto itself is not a bad decision; and indeed, this time TI actually put a cache between the high-latency flash and the CPU to mitigate the latency penalty. And TI even bothered to operate the flash in its QSPI mode, which allows the flash to return data four times faster than standard SPI. Unfortunately, their implementation is still very sub-optimal.

The main problem with the new serial flash is its speed. The flash is spec'd for operation at up to 133 MHz. So what speed does TI operate it at? 24 MHz. That's right, TI took a high-speed flash and decided to operate it at half the CPU's clock speed and a fifth of its maximum speed. Now in some devices, that might be a reasonable thing to do, since SPI speed is dependent on how close the device is to the bus master. But in the TI-84 Plus CE, that limitation does not apply; TI can place the flash as close to the CPU as they like.

The worst problem with the new cache design is the penalty for a cache miss. All caches have a performance penalty for a cache miss, but TI's design makes that penalty literally ten times worse than it should be. First, like I said above, the flash is operating much slower than is necessary. But second, and more importantly, the flash memory TI chose actually has a special mode for making servicing cache misses faster. Naturally, TI does not use that mode.

If you ask the flash to start reading from a random address, it can return the first byte within about 20 clock cycles, and each subsequent byte takes two additional clock cycles to fetch. Optimally, therefore, the cache should be able to return the first byte from a cache miss after just 20 cycles, with a one-cycle penalty for each subsequent byte until the cache line is filled. This is pretty decent. However, you can't simply leave a cache line half-filled (that's not how caches work), so unless the first byte you requested was the first byte in a cache line, the cache would need to go back and fetch the first half of the cache line.

Cache servicing mode automates that last part: you tell the flash what size a cache line is---say 32 bytes, the size used on the calculator---and when you've finished reading the second part of a cache line in, the chip automatically wraps around and gives you the first half. For example, let's say the faulting byte was at offset 20 into a 32 byte cache line. The flash will first return the byte at offset 20, and then continue returning the next eleven bytes of that cache line. After those twelve bytes have been fetched, instead of moving on to the first byte of the next cache line, the flash chip returns the data at offsets 0-19 of the current cache line. Thus, the flash returns the specific data that caused the cache miss as soon as possible, while also allowing the cache to fill the rest of the cache line. It's a great feature that TI brilliantly ignored.

Instead, the cache uses the slow method of handling a cache miss: it just starts with offset 0 of the missing cache line, reads the whole cache line, and doesn't give the CPU the byte it wanted until after the cache line is filled. So if you wanted byte 0, you still have to wait for all 32 bytes of the entire cache line to be retrieved before you can get byte 0. And remember, the QSPI interface to the flash inexplicably runs at a much slower speed than the CPU itself.

The bizarre thing about this is that, given the observed 200 cycle penalty for a cache miss, the old, slow parallel flash could actually be faster than the new high-performance serial flash. The old parallel flash can run at 16 MHz, a quarter of the CPU's 48 MHz, so it could return one byte every four CPU clock cycles. Thus, filling a 32-byte cache line would only take 128 cycles. Add in 10 cycles of overhead, and you get around 140 cycles, which is substantially faster than the 200 the serial flash takes to handle a cache miss. But it gets worse: parallel flashes actually also have a 16-bit mode, allowing data to be transferred twice as fast, so you could fill a 32 byte cache line in much less than half the time the new flash requires.

Now the cache does have one other problem: it's still suboptimal at servicing cache hits. Each cache hit on the same cache line costs one wait state, and each time you switch to a new cache line, there's an additional one cycle penalty. This means that even under ideal circumstances, the CPU still always spends at least half of all clock cycles stalled, since RAM writes always have a one state penalty and cache hits always have at least a one state penalty. In reality, the CPU is probably spending around 60 to 70 % of all cycles stalled---which is still double to triple the performance of the old design.

Now, while TI is not able to alter the design the eZ80 logic block to run faster than 50 MHz, the cache is a different story. Caches are a common thing in digital chips. If a $5 STM32 can hit 100 MHz, then surely a company with world-class engineers like TI can design a zero-latency cache that can run at 48 MHz. Heck, I'd be happy with dropping the CPU to 24 MHz if it meant the cache and main RAM would all be zero-latency. Go ahead, do it backwards and run the bus at 2 the CPU's clock speed. And while you're at it, bump the SPI speed up to 48 or even 96 MHz. (Don't forget, at 96 MHz, the flash needs an extra wait state after the address before it starts returning valid data. The faster speed for filling a cache line is worth the slight increase in latency.)

TL;DR: TI took a fast CPU and found the perfect way to make it perform as poorly as possible. For version 2.0, they approximately doubled or tripled performance, but nevertheless took a fast flash memory and once again found a way to make it perform poorly. Consequently, the TI-84 Plus CE still has room for performance to be improved by a factor of two to four without requiring switching to a whole new CPU ISA.

Oh yeah, and when they come out with the next hardware version, are they going to let us look at it before going to full-scale production? Of course not. They'll then act surprised when we again start pointing out problems. TI Education Technology: We're Adults, We Don't Need to Learn From Our Mistakes; Learning Is For Students!

With thanks to jacobly and Hooloovoo for investigating the properties of the cache and flash memory interface.

IRC wrote:
<DrDnar> You know what I want even more than native code? A TI Education that isn't so damn secretive. Why can't we just have honest conversations like adults instead this demeaning can't-tell-you-anything attitude?
<DrDnar> It's like even the color of the sky in Fort Worth is considered a TI trade secret subject to NDA.
DrDnar wrote:
Texas Instruments is a world-leader in chip design. They have some of the best engineers in the world. So why is that, over the past decade, every update to the TI-83 Plus series has had horrible performance problems?

Money, dear boy. What else?

When management wants something on the cheap for yesterday, don't expect miracles. Especially when you keep piling up technical debt.

While I'm here, I'll mention for comparison NumWorks as they've made a couple of debatable trade-offs based on cost grounds:
  • The N0100 model had 1 MiB of Flash integrated on the STM32F4 and no external Flash chip (yet the motherboard had pads for soldering one). Fixed last year on the N0110 model with 64 KiB internal Flash and a 8 MiB external Flash chip, but the N0100 firmware is now nearing 90% of Flash usage and thus it's only a matter of time before it simply won't fit anymore.
  • The frame buffer of both models is accessible behind a i8080 bus. It's quite slow when compared to a memory-mapped one especially if you push lots of rectangles (or, Heaven forbids, read back from it), but with care it is possible to push full-screen updates fast enough for emulators in real-time.
  • No RTC quartz, USB-C or USB OTG. The N0110 also dropped the pads for SPI and the SD card as with the ability to mux an UART over the USB pins (but it's ST's fault as it is a limitation of the STM32F7)

But note that only the first point is anywhere near the level of TI's fails over the years. The second one is mostly irrelevant for calculation purposes and everything in the third one is not useful for everyday users. Plus the hardware and firmware are as modern as it gets.

I don't know enough about Casio calculators to have a technical opinion on them, but out of all the competitors in the calculator market for that segment, it appears only TI is guilty of egregious complacency in their hardware design.
boricj wrote:
[I]t appears only TI is guilty of egregious complacency in their hardware design.
I think you're forgetting about the Nspire family, which when it was new had decent hardware as far as I know.

The TI-83 Plus had pretty cutting-edge features back in 1999. For the time, TI designed a product that struck a nice balance between extensibility and price. TI even had an online app store long before Apple. Sadly, the time wasn't right (not enough Internet penetration) and the educational market proved to be a tough sell.

Unfortunately, it seems like TI is treating the TI-84 Plus CE as a dead-on-arrival legacy platform. All the innovation seems to be going into the Nspire. Nevertheless, the TI-83 Plus family remains popular. Perhaps a lot of teachers find the Nspire too complicated. Whatever the reason, the continued popularity of the TI-83 Plus family should be prompting TI to invest more into innovation on that platform.

boricj wrote:
When management wants something on the cheap for yesterday, don't expect miracles. Especially when you keep piling up technical debt.
Technical debt is another post entirely, but it's hardly unique to TI. I have radical ideas about how to bring the beloved TI-83 Plus series into the 21st century. But again, since we can't have any dialog with TI (you know, that involves them saying meaningful stuff back), why bother saying anything?
Quote:
I think you're forgetting about the Nspire family, which when it was new had decent hardware as far as I know.

In relative terms, the '2007 Nspire Clickpad & '2010 Nspire Touchpad, with their 90 MHz ARM9 CPUs, have much faster hardware than even the TI-68k series has, indeed. Such hardware characteristics already weren't fantastic at the time, though: the '2003 HP-49g+ and the '2006 HP-50g already had 75 MHz ARM9 cores.
The 49g+ & 50g have much better overclocking potential than the Nspire Clickpad & Touchpad: they can reach 200 MHz, while the Clickpad & Touchpad choke below 150 MHz, IIRC.

In absolute terms, a 75 MHz ARM9 processor was already poor in 2003, so a 90 MHz ARM9 processor was even poorer in 2007. But that's true of all graphing calculators anyway Smile
DrDnar wrote:
Unfortunately, it seems like TI is treating the TI-84 Plus CE as a dead-on-arrival legacy platform. All the innovation seems to be going into the Nspire.

Given the amount of work that went into this family lately (initial TI-83 Premium CE model with ez80, TI-Innovator integration, TI-Adapter+Python app, TI-83 Premium CE Edition Python model, all the stuff in the firmware changelog), I'd say it is anything but dead-on-arrival. The execution in several of these areas, as you've described, is another story. Also, most of the features added are of limited interest to power users.

Heck, I recall TI-Planet complaining about the lack of real updates on the TI-Nspire CX family for the longest time, which let the HP Prime completely take over the first place in the QCC round-up over the years. It is only recently that they've released the TI-Nspire CX II (which is barely an upgrade) and started working on Python support nearly three years after the NumWorks came out.

DrDnar wrote:
boricj wrote:
When management wants something on the cheap for yesterday, don't expect miracles. Especially when you keep piling up technical debt.
Technical debt is another post entirely, but it's hardly unique to TI. I have radical ideas about how to bring the beloved TI-83 Plus series into the 21st century. But again, since we can't have any dialog with TI (you know, that involves them saying meaningful stuff back), why bother saying anything?

As the old saying goes: you can write a letter to TI, but you can't force them to acknowledge it.

Thankfully, this is not the case of pretty much every single other competitor. Again, I don't know about Casio, but both HP and NumWorks developers will actually respond when you write to them.
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 1 of 1
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement