Kay and Cee

2022-01-03

I was reading the Wikipedia article on the tap code, a simple method of communicating with taps or knocks only, perhaps most known in the Anglophone world for being used by American prisoners of war in Vietnam to communicate while in separate cells and forbidden from speaking. It turns each letter into a pair of numbers from 1 to 5, and an easy-to-memorize table of letters, just organized alphabetically.

However, ever since the status of J and the distinction between U, V and W was finalized, the Latin alphabet has had the unfortunate property of having 26 letters, and 26 doesn't divide very nicely at all: it's just 2×13. The knock code, and similar letters-to-numbers systems like the Polybius square, need to do away with a letter or two to get down to an easier-to-divide 25 or 24 letters.

The early Cooke and Wheatstone needle telegraph had to get down to 20 letters, and so did away with C, J, Q, U, X and Z; with the ekseption of U, I can get behind eliminating these, as the letters left do fit English kwite well, with svbstitvtions rather easily made, and the only sownds hard to write wovld be word-initial [z] (like in zoo, zinc, zip, zygote, zigzag, Zoe, zucchini, xeno-, xenon (although I tend to pronounce initial X as [ks])), [dʒ] (as in job, juice, major, object, reject, although these could just as well be written iob, iuice, maior, obiect, reiect, as was done before J became a letter in its own right), and the very kommon ch-sound [tʃ], perhaps the greatest loss kawsed by the elimination of C: does one then write whitsh for which, tshoise for choice, sutsh and mutsh for such and much, eatsh or eetsh for each? Point being, eliminating C isn't a problem for those words where it's used for [s] or [k] (or both, as in accept), but rather those where CH appears, because TSH is just a little unwieldy.

The Polybius square shown in the Wikipedia article merges J with I, which works well for English; I can't think of many words where one couldn't guess that an I is read as a J, or where a JI or IJ sequence exists, except for names like Jim and Dijkstra.

But the tap code is perhaps the most interesting: it merges K into C, using C wherever K would be. I tried this out myself, writing a small program that replaces kays with cees, and I was surprised by how little text changed: K really isn't that common a letter, although it's used in some very common words, and in many cases the substitution produces a word that, although strange, still reads fine, with a [k] sound: parc, porc, ducc, thicc, picc, bacc, darc, thinc, Cansas, New Yorc, cale, catacana, coala, sicco. Many of these words feel French now!

It's the words where the c in ce, ci and cy needs to stand for [k] that readings get difficult. This breaks the commonly quite correct "rule" (altho there are few truly good rules when it comes to English spelling) that c followed by e, i, y stands for [s] – but it's not entirely without precedent: soccer and sceptic come to mind, as does Celtic, as in the language family or culture, not the football club. (Interestingly, tangentially, c does always stand for [k] in the Celtic languages, because they adopted the Latin alphabet before the sound changes turned c before front vowels into an [s]-like affricate or sibilant.)

If the parties communicating are aware of this substitution, I don't think there'd be too much confusion caused by this substitution. It'd look very weird, for sure, but one could get used to it, recognizing patterns where c + [eiy] stands for k, such as word-final -cing or -ced (networcing, thincing, freshly-baced coocies), or especially -ccing and -cced (a locced door, thiccening a sauce, briccing up windows) (yes, I'm aware of the secondary meanings of thicc).

Collisions or ambiguous pairs of course exist, but a lot of them involve a less-common word, and the correct reading should be deducible from context. To find some, I used this rather extensive word list, and the commands tr kK cC < words.txt | sort | uniq -d.

Many, perhaps most, of the collisions were of proper nouns, technical terms, and transliterations (especially from Greek or Biblical Hebrew) where a source-language κ or כ got turned into a c via Latin, but into a k via a different language, and also the list I used seemed to contain a bunch of loans from other European: Athabasca, cabbalah, Calmar, Corea, dance/danke, leucemia, samech.

Some genuine ambiguous pairs: braces/brakes, talc/talk (interesting because the c stands for [k] in both, but the vowel changes), cite/kite, CO/KO, C-shaped/K-shaped (dependent on the grapheme itself), CSU/KSU, dice/dike, face/fake, flocc/flock (homophones, but with a semantic difference), forced/forked and forcing/forking, lace/lake, lice/like, Luce/Luke, mace/make, race/rake, spicy/spiky, UC/UK.

Some pairs where the [k] reading is more sensible, but the [s] reading also kinda makes sense: cissing (to kiss, as opposed to make cis), peaced and peacing (climb a mountain versus solve a conflict), puce (vomit, rather than "a brownish-purple color"), trice (a three-wheeled bike, rather than "to drag or haul"). The opposite case: spake is quite archaic, so space is probably the stuff above the sky.

To illustrate how little many texts change, I collected a bunch on this supplementary page.

The above, of course, only holds for English, where c is a particularly hard-working letter. In my other language, Finnish, c is rarely used, only in loanwords and proper nouns. The basic de facto rule as to its pronunciation is as in English: as k before a, o, u, and as s before e, i; the Finnish alphabet additionally contains the vowels y, å. ä. ö (of which å is only used in names and loanwords from Swedish), and nobody really knows how to pronounce c before them. This changes from speaker to speaker, of course, and there are exceptions, for example a learnèd person would pronounce the element name cesium as /ˈke̞ːsium/, since the source language is (Neo-)Latin. I've encountered people who pronounce all cees (and zees, for that matter) as the affricate /tʃ/, itself a foreign sound to Finnish but easily produced by most, or the native consonant sequence /ts/.

A tangent: /ʃ/ and /tʃ/ are easy foreign consonants for the Finnish-speaking mouth, compared with other English consonants like /dʒ/, /ɡ/ or /b/, with j-containing words often just pronounced with a /j/, the English "y-sound" (e.g. the singer Juice Leskinen's first name is pronounced /ˈjuise̞/); most speakers (especially younger, less than, say, 50) do know the difference between /b/ and /p/ and /k/ and /ɡ/, and when speaking carefully can make that distinction, but since it rarely matters it's just easier to pronounce them with /k/ and /p/.

Anyway, so c isn't a commonly-used letter at all, but given a text where all ks have been replaced with cs, a Finnish-speaker would easily recognize and adapt to this change, even with the front vowels; there's basically no chance of confusion. Interestingly, the vibe one gets is one of firstly a foreign accent (I, for one, automatically read a k→c-substituted text by pronouncing c as an aspirated k, [kʰ], like in English); and secondly, of very old text, from an earlier stage of the language, like how an Englishe-speakere feeleth abowt readyng a texte with extra es at the endes of wordes and variouse spelyng mistayckes and thou and verbs ending in -th and -st; basically the ye olde vibe. The ye olde vibe is enhanced or also provoked by substituting every v (also a common letter) with w. The reason for this is that the first dude to write down substantial texts in Finnish, Mikael Agricola, who translated the Bible etcetera in the 16th century, used a mix of Latin, German and Swedish orthography for Finnish, plus in a dialect that's not the modern book language, plus during a time when a few sound changes were going on (for example the loss of /θ/ and the diphthongization of /o̞ː/ into /uo̞/); it took a few centuries for spelling to standardize and settle down into the modern, regular form, and even then w was regularly used instead of w into the 20th century, possibly partly because Finnish was typeset in blackletter type where u and v look almost identical.

I was curious as to how often c occurs in my own Finnish writing, and so I took all the blog posts I have here and did a quick letter frequency calculation. I found that the Finnish equivalent of English etaoin shrldu is aitnes lkoäum, with 12% of letters being a, 11% being i, 9% being t, and 8% being n; this makes sense, considering what case endings look like and that the most common word is probably ja, "and"; no real surprises.

What was more surprising was the frequency of the dash, used to separate the parts of certain compound words, was slightly more common than the most-common nonnative letter, which was c; I write with a lot of English names and loanwords (I have posts on both the Transtech Artic tram and on Discord), so c appears more often in my text than others. Also surprising was how little ö gets used; I always suspected and frankly knew that it was the least common vowel, but 0.37% is a lot less than 1.64%, the frequency of y. G is the least-used semi-native letter: it's used in the ng digraph, and in a lot of loanwords, like mega, digi, kg, vege, algoritmi.

A	11.9	J	1.72
I	10.7	Y	1.64
T	8.97	H	1.53
N	8.38	D	1.13
E	7.69	-	0.46
S	7.13	C	0.42
L	5.83	Ö	0.37
K	5.65	G	0.37
O	5.46	B	0.23
Ä	4.74	F	0.21
U	4.63	W	0.09
M	3.35	X	0.08
R	2.74	Z	0.04
V	2.40	Q	0.02
P	2.06	Å	0.01

Letter frequency in percent of the Finnish section of my blog, as of today; n=78950.

Index