Since this is a somewhat recurring question, I think I might share some of the experience.
Even regardless of the exam mode support, installing new OSes in ROM has limited applicability because it's so easy to brick a calc while doing that. Plus, it would require tricks that might allow to bypass exam modes, so communities won't necessarily like the initiative. On the other hand, using a different kernel in an add-in is possible. I think it's a much more reasonable approach; in fact, we've been doing it for years in the Planète Casio community, so I can attest that it's both possible and very stable!
Now, to address the potential XY problem. If you "just" want to have concurrency in an add-in, you don't necessarily need an OS/kernel-level support.
The most straightforward option is to stay at the PrizmSDK level and use the system timers. The system timers are locked to 40 Hz, which I believe is fine for threading (too much context switching will kill performance after all). The timer callbacks are asynchronous but there is no parallelism, ie. the normal flow of the program is interrupted and stopped while the callback runs. The actual problem with most timer/display related functions it thus not concurrent-safety but reentrancy.
If you want better precision than 40 Hz, you can use the hardware timers which can be either 32768 Hz or ~7.5 MHz depending on which you use (and you can use them without a custom kernel!), however you won't have interrupts (you will need some yield-style action in the threads). This is because hardware timers use hardware interrupts and to catch them you would basically need a custom kernel.
Yatis has worked on a threaded version of gint at some point, and his version had both software
and gint's hardware context saves, allowing to switch tasks but also kernels for debugging and tracing syscalls. It's pretty involved, but I guess the bottom line is that mostly anything is possible.
Also if anyone is reading this without knowing about the specific hardware of the fx-CG, let's clear any doubts now that actual parallelism is impossible since there is only one CPU (unless someone manages to reverse-engineer the DSP cores of the SPU2 module) and thus any concurrency would be on the scheduling side. ^^
Now if you want to actually port on existing real-time OS that's a whole different matter and it sounds much more difficult.