Vulgate character count

TL,DR: The Sixto-clementine Vulgate of 1592 has 3,312,498 letters in 615,596 words. There are 45,501 distinct words, of which 20,664 occur only once.

I came across someone mentioning that it's hard to find how many words or letters the Latin Vulgate version of the Bible contains. This was in some comment section somewhere; if I remember correctly it was in a discussion of how long it would take a medieval scribe to write a copy of the Bible. So, I decided to solve this issue and provide a character and word count.

There are three main editions of the Vulgate that have been official Bibles of the Catholic Church: the Sixtine, published in 1590; the Clementine is a revision published two years later in 1592; and finally there's the Nova Vulgata, published in 1986. The Nova Vulgata is the current official version, and can be read online on the Vatican's website. The Clementine Vulgate can also be read online, on Wikisource.

Neither of these would be copied by hand, though: the printing press had already been a thing for well over a hundred years by the time these official editions would be published. They're still easy to count, though, because they exist digitally.

Then there's the question of what all to count: do I count front matter, like various preambles and introductions? Do I count the table of contents – what about page numbers? What about chapter numbers, what about verse numbers? Do I count the chapter headings (Caput 1 etc.)? Do I count the titles of the books, and if I do, do I count the entirety of Liber Genesis, Hebraice Beresith (four words) or just Genesis? Do I count all six words of Sanctum Jesu Christi, Evangelium Secundum Matthæum, or just Matthaeum? What is scripture and what is just metadata? Do I count Æ or Œ as one letter or two?

I lean to the following answers to the above questions: no front matter, no preambles, no table of contents, no page numbers (which don't even exist in a digital edition). No book titles, no chapter heading titles – these are metadata and could easily change from manuscript to manuscript: one writes caput 18, the other writes XVIII, another doesn't contain them at all; one writes Liber Deuteronomii Hebraice Elle Haddebarim, the other just Deuteronomii. Because the language is Latin, Æ and Œ count as two letters – they're ligatures, not separate letters like in Scandinavian languages – but it's easy enough to count them out separately. I pare the text down to the absolute minimum for this counting project: just the text, no numbers, no titles.

Finally, since the Vulgate is a Catholic bible, it contains all the books of the Apocrypha, as part of the Old Testament, not separated out into their own section like in the King James Version.

Clementine Vulgate (1592)

This was the easiest: I went to Wikisource and clicked the download-as-plain-text button. I got a text file, UTF8-encoded (since it contains characters like æ and ë), 4522972 bytes in size. It needed some processing, since it contains all chapter titles (but oddly, no book titles) and verse numbers and also some front matter: "Exported from Wikisource on September 6, 2022", for example.

I transformed this text file into a numberless version like so, specified here for re­pro­duci­bility:

  1. With a liberal mouse sweep all front matter was deleted.
  2. All rear matter was also deleted (the Reliqua quæ continentur section, starting with Errata and Ordo librorum [a counting of errors in previous (before the 1598 publication this digital version was based on) editions, and a listing of what order the books come in], then the Hieronymi præfationes subsection ["Jerome's prefaces", notes from the translator], then the Apocryphi libri [containing a short note of why these non-canonical books are still contained in this publication, then the Prayer of Manasses and the Third and Fourth books of Ezra], then the Appendices [a list of testimonies and a glossary or dictionary of Hebrew, Chaldean and Greek names and words], then a final "about this digital edition" section).
  3. Chapter titles were found and removed with the regular expression /Caput [[:digit:]]+/. There were 1341 such titles removed.
  4. Verse numbers were found and removed with the regular expression /[[:digit:]]+ /. There were 35558 such numbers removed.
  5. The first text in the file was now "In principio creavit Deus cælum et terram. Terra autem erat inanis et vacua, et tenebræ erant super faciem abyssi : et spiritus Dei ferebatur super aquas." and the last text was "Etiam venio cito : amen. Veni, Domine Jesu. Gratia Domini nostri Jesu Christi cum omnibus vobis. Amen."
  6. The number­less­ness of the file was verified with a program I wrote that would count how many times each character occurred. There were some occurrences still remaining – 57 ones, 35 twos, 23 each of threes, fours, and fives, 20 sixes, and 12 each sevens, eights, nines, and zeroes: a distribution consistent with there still being some verse numbers left. 229 digits left in total.
  7. Many of these leftover verse numbers were contained in Lamentationes, attached to the following word. These were removed with the regular expression /[[:digit:]]+/ (no space). The 229 digits were found in 133 total numbers. (In retrospect, I should've used that regex in the first place.)
  8. There were no digits left in the file.
  9. File size at this point is down from 4.5 megabytes to 4112125 bytes – a reduction of almost 411 kilobytes, or a quarter of a floppy disk.

This numberless version contains just the text of the scripture itself, with no verse numberings. I ran my character analysis program again (which just counts how many times each Unicode character appears), and I learned the following:

Character counts of this bare version:


With Ë merged into E, Æ expanded into AE and Œ expanded into OE, we get 415563 E's, 266558 A's and 177564 O's, and increasing the total letter count from 3288691 to 3312498.

I continued further simplification of the file: all letters into lowercase, all ligatures expanded, all punctuation removed, all repeaded spaces collapsed into one. This allows me to produce a word count: 615596 total words, 45501 distinct words.

The most common word is et, with 50969 occurrences (no wonder that it's always written as the ampersand & in the physical book); the second most common is in, with 22937 occurrences; est has 8849, ad has 7695, non has 7363, qui has 6805, ejus has 5190, autem has 5130, de has 4871, ut has 4670, cum has 4183, dominus has 3608. There are 20664 words that only occur a single time; sorted alphabetically, the first five are aasbai, abaddon, abalienabit, abalienavit, abana, and the last five are zethu, zizaniorum, zoheleth, zomzommim, zuzim. Many of these hapax legomena are just inflected forms of otherwise also occurring words: conversabitur, conversare, conversatur, conversentur and converseris all occur only once, but are all inflections of the verb converso, "I turn around".

The letter K is rare in Latin, and in this book it only occurs 49 times, in the names Joakim (46 times) and Eliakim (thrice).