Link to Github repo
Link to progress update 4/29/19

This is going to be a very niche use case, but ah well, I'm gonna post it anyway. As the title states, I am well underway into making a translator from Ciceronian Latin into standard English on a Voyage 200. As far as my influence for making this project, I have to admit it was mostly spite (a terribly good motivator) as both my classmates and my Latin teacher said it was impossible.

"ImPoSsIbLe"

The lateng() Function

The program is heavily abstracted to allow for grammar rules to be easily added when necessary. At its highest level, the program is divided into a grammar organizer function, and a word identifier function. The program (function lateng()) takes a single string input.

Code:
lateng("In pictura est puella.")

The following documentation has been quoted for posterity, but it is not representative of the current state of this program.
Sam wrote:
Immediately, the program splits the input into a list containing a word per element, and corrects all cases to lowercase and removes punctuation.

Code:
"In pictura est puella."
{"in","pictura","est","puella"}

The program them replaces each element with the translated version of the word, using the identify() command, which I'll cover below.

Code:
{"in","pictura","est","puella"}
{"in","picture","is","girl"}

The program finally applies necessary articles, capitalization and punctuation, the concatenates the list.

Note: This only works for crude Latin that is already in English word order. The remaining work I need to do entails rewriting lateng() to translate Latin in any word order. I've been focusing on identify() more than anything else so far.


Code:
{"in","picture","is","girl"}
"In the picture is a girl."


The identify() Function

The identify() function is a very versatile function which will translate any Latin word contained in its libraries into English and necessary data. It takes a latin word as input, and returns a list containing various data depending on part of speech.

I should note that I am aware that textbook Latin is far removed from real Latin, but this program will translate either just fine. Real Latin might be much more difficult for a person to translate, but the program does not care. Real Latin’s excess of grammatical exceptions and obscure rules is, counterintuitively, quite trivial to code in. (When() statements ftw!) This program will translate real Latin exactly as easily as it translates garbage textbook Latin.

Code:
Verb: {part of speech, English translation, conjugation, tense, person, singular/plural}
Noun: {part of speech, English translation, declension, case, gender, singular/plural}
Adj:  {part of speech, English translation, declension, case, gender, singular/plural}
Pro:  {part of speech, English translation}
Conj: {part of speech, English translation}
Prep: {part of speech, English translation}
Adv:  {part of speech, English translation}
Int:  {part of speech, English translation}

As well as taking a string input, identify() uses two massive library lists (EDIT: It now uses three.) to determine word translations, wrda and wrdb. wrda contains strings holding Latin word bases, as well as various part-of-speech-specific data.

Code:
//Various entries contained in wrda:

//The first digit always denotes part of speech, while the following digits are specific to part of speech. The digital data always takes up four characters, even if the word does not use them.

"1320aestat aestas "
//The 1 means this is a noun, the 3 means the noun is third declension, the 2 means it is feminine, and the 0 means it is not living. The noun has two bases, meaning the second base is the nominative singular of the word and the first base is the base for the remaining words.

"5xxxad "
//The 5 means this is a preposition, but since prepositions have no extra data, the unused characters are filled with x. The Latin word is "ad"

"0100ambul ambulare ambulav ambulat "
//The 0 means this word is a verb, the 1 means the verb is first conjugation, the second 0 means the verb is neither irregular nor does it take dative nouns, and the third 0 means the verb is not deponent. Because it is not irregular, the verb has four principal parts,the first being the present base, then the infinitive base, then the perfect base, then simply the fourth principal part, used for perfect, pluperfect, and future perfect verbs that are passive.

wrdb is much simpler, giving simple English translations with no appended data. Below are corresponding wrdb entries to the wrda entries above.

Code:
//Various entries contained in wrdb:
 
"summer "
 
"to "
 
"walk walking walked "
//Verbs are unique in that they contain three Latin translations corresponding to various tense/person/number combinations.

The following documentation has been changed slightly since its conception. View the changes here.

The identify() command uses inString() to look for Latin bases within your input string. For instance, identify() will decide the input string "aestatem" will match "1320aestat aestas " because "aestat" is a substring of "aestatem". For wrda entries with certain parts of speech, the input string must be identical to the base for identify() to recognize it. For instance, identify() knows that "ad" means "to" but "adeiufosdfv" does not.

Once identify() matches the input string to a word (or doesn't, in which it returns null) it then checks to see if the part of speech requires an ending beyond the base (with exceptions, such as irregular verbs and irregular nominative singular nouns). If it does, if chops the ending off the input string and runs it through every possible ending for that word, and adds the results to a list of possible endings. For words which have an ending that can be translated in multiple ways, it simply adds it to the list. Below are some examples of what identify would return.

Code:
identify("aestatem")
{"Nou","summer","Acc","Fem","Sin"}
//This noun means summer, it is in the accusative case, it is feminine, and it is singular.

identify("aestatibus")
{"Nou","summer","DatAbl","Fem","PluPlu"}
//This is the same noun with a different ending that has multiple meanings. It can be dative or ablative plural.

identify("ambulati eramus")
{"Ver","had been walked","1st","Plu","Pas","1st","Plu"}
//This verb translates as "had been walked", is 1st conjugation, pluperfect tense, passive voice, 1st person, and plural.

identify("ad")
{"Pre","to"}
//This preposition simply means "to".

Thus this program will translate any Latin thrown at it, granted that it is contained within the word library. This is the biggest limitation, because I have to log all the words by hand into my Voyage 200. I've made a shell program to assist me, but it's still tedious. Another annoyance is the fact that lists are not random access, meaning it takes significantly longer to identify "villa" than it does to identify "ager" (wrda is sorted alphabetically). With these in mind, the calculator ends up taking around three minutes to translate a full sentence. "In pictura est puella" took a minute and a half. I'll put the source code up whenever I get around to it.
This looks like a cool program! What would be a fun test is to see if it could translate an entire book of the Aeneid or something and how long it would take. How are you logging the words into your V200? Are you taking them out of a dictionary? I love what you're doing here. Keep it up! Very Happy

Also, will this support the TI-89?
Sam wrote:
Another annoyance is the fact that lists are not random access, meaning it takes significantly longer to identify "villa" than it does to identify "ager" (wrda is sorted alphabetically). With these in mind, the calculator ends up taking around three minutes to translate a full sentence. "In pictura est puella" took a minute and a half. I'll put the source code up whenever I get around to it.


Have you considered using a binary search algorithm for finding words? It may help a bit. (Though only if you're using ASCII, as AMS's string comparison operators are buggy with character codes >127, such as accented letters.) Even so, I suspect the list lookup routines on 68K calcs not to be particularly efficient; it probably does a linear scan through the entire list in order to find the nth item.

Another consideration you may want to be aware of before you get too far is that in my experience, TI-BASIC has problems as soon as a variable grows beyond a certain size. Technically there is a hard limit of about 64K per variable, but you may encounter memory errors working with variable sizes of only half of this or less. When I wrote programs that stored large datasets in lists and matrices, I ended up having to break them up into several variables. If you separated your database into separate variables in some way (for instance, one for each starting letter of the word), this may improve performance and defer any memory issues you might encounter if the dictionary grows large. The # (indirection) operator or the expr() function come in handy when using schemes like this.
RogerWilco wrote:
This looks like a cool program! What would be a fun test is to see if it could translate an entire book of the Aeneid or something and how long it would take.
That would indeed take a while! If I were to approximate it off the top of my head, I'd say with the vocabulary and grammar knowledge required, combined with the Aeneid's massive length of 9896 lines of dactylic hexameter, it's probably a 9 month job or so.

Quote:
How are you logging the words into your V200? Are you taking them out of a dictionary?
Good question! The translator currently supports a measly 43 words, which is the whole vocabulary list of the Ecce Romani 1 Chapter 1 textbook. Since this is primarily for Latin I'll be using in those classes, the most words I'll enter en masse would be the entire Ecce Romani 2 vocabulary index. As the list gets larger, I'll implement some sort of hash feature which would allow the program so search for words quicker. That's hard though.

Quote:
I love what you're doing here. Keep it up! Very Happy

Also, will this support the TI-89?
Thanks, man. It sure will!
Sam wrote:
RogerWilco wrote:
This looks like a cool program! What would be a fun test is to see if it could translate an entire book of the Aeneid or something and how long it would take.
That would indeed take a while! If I were to approximate it off the top of my head, I'd say with the vocabulary and grammar knowledge required, combined with the Aeneid's massive length of 9896 lines of dactylic hexameter, it's probably a 9 month job or so.

Quote:
How are you logging the words into your V200? Are you taking them out of a dictionary?
Good question! The translator currently supports a measly 43 words, which is the whole vocabulary list of the Ecce Romani 1 Chapter 1 textbook. Since this is primarily for Latin I'll be using in those classes, the most words I'll enter en masse would be the entire Ecce Romani 2 vocabulary index. As the list gets larger, I'll implement some sort of hash feature which would allow the program so search for words quicker. That's hard though.

Quote:
I love what you're doing here. Keep it up! Very Happy

Also, will this support the TI-89?
Thanks, man. It sure will!


Awesome! Thanks! Once you complete this program and release it to the public, I might try to translate the first book of the Aeneid using an emulated TI-89. I thought I recognized the test line you were using. Very Happy
RogerWilco wrote:
Sam wrote:
RogerWilco wrote:
This looks like a cool program! What would be a fun test is to see if it could translate an entire book of the Aeneid or something and how long it would take.
That would indeed take a while! If I were to approximate it off the top of my head, I'd say with the vocabulary and grammar knowledge required, combined with the Aeneid's massive length of 9896 lines of dactylic hexameter, it's probably a 9 month job or so.

Quote:
How are you logging the words into your V200? Are you taking them out of a dictionary?
Good question! The translator currently supports a measly 43 words, which is the whole vocabulary list of the Ecce Romani 1 Chapter 1 textbook. Since this is primarily for Latin I'll be using in those classes, the most words I'll enter en masse would be the entire Ecce Romani 2 vocabulary index. As the list gets larger, I'll implement some sort of hash feature which would allow the program so search for words quicker. That's hard though.

Quote:
I love what you're doing here. Keep it up! Very Happy

Also, will this support the TI-89?
Thanks, man. It sure will!


Awesome! Thanks! Once you complete this program and release it to the public, I might try to translate the first book of the Aeneid using an emulated TI-89. I thought I recognized the test line you were using. Very Happy
This does remind me that I have quite insufficient error handling as of now, so I need to fix that. If you tried to translate it now, it would probably give you a data type error in the first minute.

Also, curse you TI for not giving us any good debugging tools.
Sam wrote:
RogerWilco wrote:
Sam wrote:
RogerWilco wrote:
This looks like a cool program! What would be a fun test is to see if it could translate an entire book of the Aeneid or something and how long it would take.
That would indeed take a while! If I were to approximate it off the top of my head, I'd say with the vocabulary and grammar knowledge required, combined with the Aeneid's massive length of 9896 lines of dactylic hexameter, it's probably a 9 month job or so.

Quote:
How are you logging the words into your V200? Are you taking them out of a dictionary?
Good question! The translator currently supports a measly 43 words, which is the whole vocabulary list of the Ecce Romani 1 Chapter 1 textbook. Since this is primarily for Latin I'll be using in those classes, the most words I'll enter en masse would be the entire Ecce Romani 2 vocabulary index. As the list gets larger, I'll implement some sort of hash feature which would allow the program so search for words quicker. That's hard though.

Quote:
I love what you're doing here. Keep it up! Very Happy

Also, will this support the TI-89?
Thanks, man. It sure will!


Awesome! Thanks! Once you complete this program and release it to the public, I might try to translate the first book of the Aeneid using an emulated TI-89. I thought I recognized the test line you were using. Very Happy
This does remind me that I have quite insufficient error handling as of now, so I need to fix that. If you tried to translate it now, it would probably give you a data type error in the first minute.

Also, curse you TI for not giving us any good debugging tools.


Cool! Very Happy

I know, right? I'm working on something for the 85 right now and all I have to work with is *shutters* graph-link. I'm using sourcecoder to make my code more readable.
Quote:
Another annoyance is the fact that lists are not random access, meaning it takes significantly longer to identify "villa" than it does to identify "ager" (wrda is sorted alphabetically).

Could you try getting around this by having 26 lists, one for each letter, so that each list is much shorter than the original long one. Not sure if this would make a difference or not. Keep up he good work!
Legoman314 wrote:
Quote:
Another annoyance is the fact that lists are not random access, meaning it takes significantly longer to identify "villa" than it does to identify "ager" (wrda is sorted alphabetically).

Could you try getting around this by having 26 lists, one for each letter, so that each list is much shorter than the original long one. Not sure if this would make a difference or not. Keep up he good work!
I was thinking the same thing, but it just seems really hacky, and I'm convinced there's got to be a better answer. The issue with 26 lists is that they aren't 26 lists of identical length. The "a" list will probably be hundreds of times longer than the "z" list, which is why I'm thinking about some sort of hashing system where maybe there's 64 lists of similar length. We'll see. If you have any revelations about how I can fix this problem prettily, please tell me.
Travis wrote:
Sam wrote:
Another annoyance is the fact that lists are not random access, meaning it takes significantly longer to identify "villa" than it does to identify "ager" (wrda is sorted alphabetically). With these in mind, the calculator ends up taking around three minutes to translate a full sentence. "In pictura est puella" took a minute and a half. I'll put the source code up whenever I get around to it.


Have you considered using a binary search algorithm for finding words? It may help a bit. (Though only if you're using ASCII, as AMS's string comparison operators are buggy with character codes >127, such as accented letters.) Even so, I suspect the list lookup routines on 68K calcs not to be particularly efficient; it probably does a linear scan through the entire list in order to find the nth item.
I'm not sure what you mean by a binary search algorithm, but it seems like it could be very useful. How would I implement this?
Quote:
Another consideration you may want to be aware of before you get too far is that in my experience, TI-BASIC has problems as soon as a variable grows beyond a certain size. Technically there is a hard limit of about 64K per variable, but you may encounter memory errors working with variable sizes of only half of this or less. When I wrote programs that stored large datasets in lists and matrices, I ended up having to break them up into several variables. If you separated your database into separate variables in some way (for instance, one for each starting letter of the word), this may improve performance and defer any memory issues you might encounter if the dictionary grows large. The # (indirection) operator or the expr() function come in handy when using schemes like this.
You aren't the first to suggest the alphabetical lists, lol. I could indeed do that, but it just seems so hacky. I'd rather have a variable amount of lists with a finite amount of elements per list.
Sam wrote:
I was thinking the same thing, but it just seems really hacky, and I'm convinced there's got to be a better answer. The issue with 26 lists is that they aren't 26 lists of identical length. The "a" list will probably be hundreds of times longer than the "z" list, which is why I'm thinking about some sort of hashing system where maybe there's 64 lists of similar length. We'll see. If you have any revelations about how I can fix this problem prettily, please tell me.


I would probably implement the word storage using linked lists. You can find some routines for linked list in Michael Abrash’s Graphics Programming Black Book, the section entitled linked list (written in C, of course, but they're pretty easy to understand.)

Here are my own thoughts.
First, you need a format for the links between records. Since you're doing it with many separate lists that you want to use as general memory, I'd recommend doing it as something like this: XX###, where XX is a two digit list name, and ### is the list entry. (The indirection operator makes this a lot easier...) Second, to make sure it actually can find them alphabetically quickly, I would have a list that simply stores links to the first word alphabetically of a certain letter (I.E. the word that starts with C that would alphabetically come before all of the other words that start with C.) To make sure memory is reused if you ever have to remove entries, I'd recommend having a couple more lists: one that contains a set of orphaned links (i.e. ones that have been deleted but are in the middle of a list.) You'd also probably want another list that stores the next set of links for each list. (The next location you can store to in each of the lists.)

I'd recommend using linked lists for this because it's so much easier to add records in the middle of a list and such. If you need to look at the routines, take a look at this: http://www.jagregory.com/abrash-black-book/#linked-lists.
Thank you for all the suggestions! I think I've arrived upon my solution to the word searching issue. As I said earlier, the current unoptimized version used two lists, wrda and wrdb, to hold the data. It uses inString() to look for the bases of wrda as a substring in the input string.

What I plan on doing is introducing a third variable, but instead have it be an unbroken string, which I can use inString() on without needing to run it through my homemade For loop. The string would be made of thousands of entries of different lengths, but that's ok because multiple entries can point to the same wrda element. The first four digits of an entry will be a numerical value literally just denoting the position in wrda that the word in the string points to, then the base of the word following it. Then, I'll use inString() backwards and instead look for the input string within the massive index string. If it doesn't find it, it subtracts a character from the input string until it does. This is massively faster than the current solution, as it basically offloads all the work to the OS's built in functions. More importantly, I'll be accessing an element of wrda once instead of hundreds of agonizingly slow times. Here's an example of what the first few elements of wrda and wrdindex would be:
Code:
//wrda
{"5xxxad " "1320aestat aestas " "1210agr ager "}

//wrdindex
"0001ad0002aestat0002aestas0003agr0003ager"
This also effectively solved the issue of scaling to keep the memory safe, because it's trivial to split wrda into multiple lists of 500 or so elements and just use # to point the program to different ones, or even split wrdindex into multiple strings.

I'll have the calc generate that string now, and I'll get back with results.
Sam wrote:
I'm not sure what you mean by a binary search algorithm, but it seems like it could be very useful. How would I implement this?


https://en.wikipedia.org/wiki/Binary_search_algorithm

The basic idea is to compare the item to find with the middle value in a list, and assuming that the list is sorted, we know that if the items don't match then the item we're looking for must be either in the first or second half of the list depending on whether the middle item is less than or greater than what we were searching for. So we take that half of the list and repeat the process, checking the middle item in the half, breaking that down into another half-list and so on, until the item is either found or we've checked two adjacent items in the list and determine that the item isn't there at all since it would have been between those two items.

Quote:
You aren't the first to suggest the alphabetical lists, 0x5. I could indeed do that, but it just seems so hacky. I'd rather have a variable amount of lists with a finite amount of elements per list.


Yeah, ideally the hashing algorithm would yield an even distribution. This just happens to be one of the easiest. Anything more sophisticated would likely make the program even slower than it already is.

Sam wrote:
Thank you for all the suggestions! I think I've arrived upon my solution to the word searching issue. As I said earlier, the current unoptimized version used two lists, wrda and wrdb, to hold the data. It uses inString() to look for the bases of wrda as a substring in the input string.


Ah, that's an interesting solution. I suspect searching strings will be much faster and have less overhead than lists or matrices. I would suggest separating each entry in the string with a character that will never appear in a word and include that in the search string. That way it won't be confused if one word contains part of another (for example, if you have a word “abcd” and a word “abcdef”, you don't want it to find “abcd” when searching for “abcdef”).
Here is a Github repo with all the source code in plaintext and in TI format.

EDIT 4/25/19 0816: The word index works incredibly well. The word search time has dropped from minutes to a few dozen milliseconds. With this in mind, I am now working on preparing the program for use with extremely large libraries. There will end up being dozens of library vars in the end.
Progress Update:

Over the past few days I have been working to improve the operation of lateng(), eventually to the point of fully autonomous word reordering and translation. Latin is a language of very minimal word order, whereas English has a very strict sentence structure, so the exercise here was to help the calculator to recognize what words go where, and what words go with other words. This was slightly made easier by the fact that some types of words can be always linked to another word, and therefore can be treated as one word. Adjectives can be linked to their respective nouns and treated as a single noun, for instance. Anyway, here's a rudimentary ruleset that lateng() has been taught.

Word Order: Nominative -> Verb -> Accusative (+dative if special verb) -> Prepositional phrase/ablative exceptions/dative exceptions/accusative exceptions

Adverbs can go anywhere in the sentence, and thus are placed at the end
Participles are treated as adjectives that follow the noun
If more than 1 verb, more than one clause, look for conjunctions and punctuation

If a word has more than one possible meaning:

Nominative or accusative?
1) Already have nom? Acc. Already have acc? Nom.
2) Check exceptions

Dative or ablative?
1) Prepositional phrase? Abl.
2) Evil verb direct object? Dat.
3) Comparative? Abl.
4) Can I date it? Yes=dat. No=abl.

As far as articles go and when to use which (when to say a girl or the girl), there isn't really a Latin way to differentiate these things, though we know that English has a great deal of difference between "a" and "the". The situation I have devised simply uses "a" if the noun has not yet been mentioned in a previous sentence, and uses "the" every other time. These article rules have proven to have many exceptions, however, so I have some work to do.

Finally, the remainder of my work has been directed towards logging words and making the program able to handle a dummy thicc word library in the future. The program now automatically parses the library vars in groups of 500, so I won't be getting memory problems. Also, an update on speed: The program's speed has increased by an order of magnitude, shortening the hypothetical translation of the Aeneid from 9 months to 24 days by my estimation. I hope to bring this down to within a day in the future.
Awesome work! Some of this stuff like the mechanics of the program could go in the README. Very Happy

One question I have is how will you differentiate between second and third declension verbs? Their infinitive forms are the same save the fact that the second declension has a long e as opposed to a short e. I forget if this is carried when declining or not. I guess a better question would be how you are planning to deal with long marks and long vowels in general. Smile
RogerWilco wrote:
Awesome work! Some of this stuff like the mechanics of the program could go in the README. Very Happy

One question I have is how will you differentiate between second and third declension verbs? Their infinitive forms are the same save the fact that the second declension has a long e as opposed to a short e. I forget if this is carried when declining or not. I guess a better question would be how you are planning to deal with long marks and long vowels in general. Smile
The verb conjugations are in the wrda metadata, so the program will always know what a given conjugation is.
Sam wrote:
RogerWilco wrote:
Awesome work! Some of this stuff like the mechanics of the program could go in the README. Very Happy

One question I have is how will you differentiate between second and third declension verbs? Their infinitive forms are the same save the fact that the second declension has a long e as opposed to a short e. I forget if this is carried when declining or not. I guess a better question would be how you are planning to deal with long marks and long vowels in general. Smile
The verb conjugations are in the wrda metadata, so the program will always know what a given conjugation is.


Ok, thanks for the info!

Sorry about calling them declensions. As a Latin III student I should know that the correct term is conjugation.
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 1 of 1
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement