Unofficial Cemetech Programming Contest
x25519 Key Exchange
Calling all efficient ez80 assembly programmers...

Overview
x25519 KEX is the last remaining algorithm needed for a functional, bare-minimum TLS 1.3 implementation within lwIP-CE. Rather than do it in slow and unoptimized in ez80 or C myself, or just post and ask for someone to do it, I decided to try to turn it into a contest to boost engagement, challenge the community, and offer something in return.

Objective
We're looking for the fastest, most efficient, and most timing-consistent implementation of x25519 key exchange (and the required field arithmetic) for the TI-84+ CE calculator.

Timeline: Contest closes February 21st.

Prizes:
First Place: $150 Gift Card to Retailer of Your Choice
Second place: $75 Gift Card to Retailer of Your Choice
Amazon, Steam, Best Buy, Target, or any major retailer
* Qualifying submissions are welcome to have their implementations live in this repository for archival and comparison purposes with published benchmarks and author credit preserved. This is entirely optional - useful if you don't have another home for the code. Just state so in the submission.

-----

What is X25519?
X25519 is a Diffie-Hellman key exchange algorithm using Curve25519, one of the most widely-used elliptic curves for modern TLS connections. It allows two hosts to negotiate a shared key for encryption with forward secrecy.

-----

What You Need to Implement

Implement these two functions according to RFC 7748:


Code:

/**
 * @brief X25519 scalar multiplication (compute shared secret)
 *
 * @param shared_secret Output shared secret (32 bytes, little-endian)
 * @param my_private    Our private scalar (32 bytes, will be clamped internally)
 * @param their_public  Peer's public key point (32 bytes, u-coordinate)
 * @param yield_fn      Optional: callback for cooperative multitasking (may be NULL)
 * @param yield_data    Optional: context passed to yield_fn (may be NULL)
 * @return true on success, false on error (e.g., low-order point)
 */
bool tls_x25519_secret(
    uint8_t shared_secret[32],
    const uint8_t my_private[32],
    const uint8_t their_public[32],
    void (*yield_fn)(void *),
    void *yield_data
);

/**
 * @brief Generate X25519 public key from private key
 *
 * @param public_key  Output public key (32 bytes, u-coordinate)
 * @param private_key Input private scalar (32 bytes, will be clamped internally)
 * @param yield_fn    Optional: callback for cooperative multitasking (may be NULL)
 * @param yield_data  Optional: context passed to yield_fn (may be NULL)
 */
bool tls_x25519_publickey(
    uint8_t public_key[32],
    const uint8_t private_key[32],
    void (*yield_fn)(void *),
    void *yield_data
);

Note: The yield_fn callback is for network keepalives during long computations. It will be NULL during testing, but should be called every 5-10 seconds during production use. Contestant should implement the logic to call yield_fn, or at least leave a note to indicate where it should be added.

Key Requirements:

  • Clock in at 30s or less.
  • Pass ALL RFC 7748 test vectors (100% correctness required) => Failing any test vector disqualifies your submission.
  • Handle edge cases (low-order points, clamping, etc.)
  • No Hardcoded Secrets: No precomputed values specific to test vectors (algorithm constants like curve parameters and lookup tables global to the curve and NOT specific to test vectors are allowed).
  • Optional: Constant-time implementation (timing attack resistance).
  • Documentation: Include brief implementation notes explaining your approach and optimizations
  • Open Source: Code must be contributed under a permissive license (MIT/BSD compatible)


Scoring Rubric

Submissions are ranked by total points. Two highest scoring submissions win.
Submissions will be graded within the following categories:

Performance (max 85 pts):
< 30s = 10pts,
< 25s = 20pts,
< 20s = 30pts,
< 15s = 40pts
+3 points for every second under 15s
Timing Consistency (max 20 points)
* Based on your own run-time deviation:
≤50% = 5pts,
≤25% = 10pts,
≤12.5% = 20pts
Code Size (max 15 points)
≤8KB = 5pts,
≤4KB = 10pts,
≤2KB = 15pts
Memory Usage (max 15 points)
≤2KB = 5pts,
≤1KB = 10pts,
≤500bytes = 15pts
Code Quality (max 10 points)
* Readability, documentation, clever optimizations
Bonus Points
+ 10 points: Tightest timing consistency (lowest std dev) among all contestants
+ 10 points: No criticals or errors from semgrep code quality assessment.

Code:
semgrep --config auto /path/to/code

Benchmarking and testing will be done using the CEmu autotester, but submission must be verified working on hardware as well.
Code size is computed as your compiled size minus skeleton test size (9760 bytes).

Total Possible: 165 points
Tie-breaker: If two submissions have identical scores, the faster implementation wins

Optimization Hints

  • Mersenne prime (2^255 − 19), reduction optimization
  • use fastMem as scratch, just zero used region after


Submission Format
Submit your entry as a GitHub repository or archive containing the following into this thread, or email it to me at info@cagscalclabs.net. Please include your Cemetech handle in the submission somewhere.:

Code:

src/x25519.c        - \
src/x25519.asm      - | Source files
includes/x25519.h   - Header file matching API
README.md      - Brief writeup (1-2 pages)


Test Suite
The test suite is provided in the contest repository, here:


Code:

contest/tests/x25519/
├── src/main.c        - Test harness with RFC 7748 vectors
├── Makefile          - Build configuration
└── autotest.json     - Automated testing config


Test Vectors:

  • 6 RFC 7748 test vectors (various edge cases)
  • 2 performance benchmarks: 5x pubkey generations, 6x secret computation.


Running Tests:

Code:

cd contest/tests/x25519
make & make test
# Transfer to calculator or run in CEmu with autotest.json


Resources


FAQ

Q: What if I find a bug in the test suite?
A: Report it immediately via GitHub issues. If confirmed, the deadline may be extended for all participants.

Q: Can I use lookup tables?
A: Yes. Precomputed constants that are part of the algorithm (curve parameters, reduction constants) are allowed. What's NOT allowed is hardcoding values specific to test vectors.

Q: Do I need to implement constant-time operations?
A: It is not strictly required, you only need to pass RFC 7748 test vectors. But it's a fairly large component of the score.

Q: How do I clamp the scalar?
A: RFC 7748 Section 5 requires: scalar &= 0xF8 (clear bits 0,1,2), scalar[31] &= 0x7F (clear bit 255), scalar[31] |= 0x40 (set bit 254). This ensures the scalar is a multiple of 8 and in range 2^254 - 2^255.

Want to Compete?

Test Vectors: link
Questions? Contact info@cagscalclabs.net

Good luck, and may the best implementation win!
I decided to give it a try, and it is a fun challenge! My first setup was almost exactly the same as in the second paper you linked, an implementation in C using long long's, but it took around 3 minutes - I didn't time it exactly, but somewhere close to that at least.

Obviously, this is not a great choice, as it is very slow and inefficient - for example, adding a long long to a long long takes quite some time, while adding 2 BigIntegers is a very fast operation. That's why I decided to move on to the assembly language, and calculating (mod p) after every operation (add, sub, mul), such that the operations only need to apply to integers < p. After optimizing a little bit, mainly focusing on the register usage and small optimizations, I got it clocking in at ~10.058 seconds (482792828 cc). Of course, it works properly as well, as can be seen in these images (I'm not sure why it shows a much higher number than 10.058 * 32768, anybody got an idea?):



I'm still busy with optimizing the code, and I've got 2 bigger ideas in mind for the fmul routine, since that code eats up a big chunk of the total time. Maybe I can get it down to 10 seconds!

The code is less than 1KB in size, and the data + read only data combined is < 400 bytes. It's also a constant-time implementation! Given your score system, I'm getting the max amount of points - as long as there are no other participants hehe. @ACagliano, can you please share what the Sangrep code quality assessment is? The code is in assembly obviously, so how does that work out?

The code lives in a private repository, I will invite you as a member such that you can see it yourself!
Resources for learning about x25519

These are just some videos I found helpful to watch as they increased my understanding on x25519.

I wish everybody good luck with the contest. 😁
PT_ wrote:
@ACagliano, can you please share what the Sangrep code quality assessment is? The code is in assembly obviously, so how does that work out?


*Semgrep. I should probably change that in the original post.


Code:

┌──────────────┐
│ Scan Summary │
└──────────────┘
✅ Scan completed successfully.
 • Findings: 0 (0 blocking)
 • Rules run: 104
 • Targets scanned: 7
 • Parsed lines: ~100.0%
 • Scan was limited to files tracked by git
 • For a detailed list of skipped files and lines, run semgrep with the --verbose flag


I scan the project as `auto`. It isn't perfect with asm, admittedly.

Code:
semgrep --config auto /path/to/code

I also am logged in so i have the extended ruleset.
I optimized my code a bunch more, and managed to shave off 100M clock cycles! Currently the tls_x25519_secret() function takes 378,325,482 cc, equivalent to ~7.88 seconds, much better than 10.1 seconds 🙂 The code size increased a bit to ~1100 bytes, which is a bit more than 1 KB (but still far less than 2 KB). Part of the size increase is due to loop unrolling. My new code also calls the yield_fn callback every ~3 seconds, but can be configured accordingly. The sad part is that I can't think of major optimizations anymore (maybe a few small tweaks, but not something worth more than 1M cycles I guess). The new timings look like this:

I think I've hit the rock bottom for my entry, and I can't seem to optimize much anymore. Compared to my previous post I still have a huge performance save. My code is currently 1700 bytes, with 350 bytes of data. The code runs in 221M clock cycles, equivalent to ~4.6 seconds without DMA. The yield_fn callback function is called only once, after 128 loops from the 256. This logic can be changed if needed. The timing screenshots now look like this:

  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 1 of 1
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement