Vulgate character count

6 September 2022

TL,DR: The Sixto-clementine Vulgate of 1592 has 3,312,498 letters in 615,596 words. There are 45,501 distinct words, of which 20,664 occur only once.

I came across someone mentioning that it's hard to find how many words or letters the Latin Vulgate version of the Bible contains. This was in some comment section somewhere; if I remember correctly it was in a discussion of how long it would take a medieval scribe to write a copy of the Bible. So, I decided to solve this issue and provide a character and word count.

There are three main editions of the Vulgate that have been official Bibles of the Catholic Church: the Sixtine, published in 1590; the Clementine is a revision published two years later in 1592; and finally there's the Nova Vulgata, published in 1986. The Nova Vulgata is the current official version, and can be read online on the Vatican's website. The Clementine Vulgate can also be read online, on Wikisource.

Neither of these would be copied by hand, though: the printing press had already been a thing for well over a hundred years by the time these official editions would be published. They're still easy to count, though, because they exist digitally.

Then there's the question of what all to count: do I count front matter, like various preambles and introductions? Do I count the table of contents – what about page numbers? What about chapter numbers, what about verse numbers? Do I count the chapter headings (Caput 1 etc.)? Do I count the titles of the books, and if I do, do I count the entirety of Liber Genesis, Hebraice Beresith (four words) or just Genesis? Do I count all six words of Sanctum Jesu Christi, Evangelium Secundum Matthæum, or just Matthaeum? What is scripture and what is just metadata? Do I count Æ or Œ as one letter or two?

I lean to the following answers to the above questions: no front matter, no preambles, no table of contents, no page numbers (which don't even exist in a digital edition). No book titles, no chapter heading titles – these are metadata and could easily change from manuscript to manuscript: one writes caput 18, the other writes XVIII, another doesn't contain them at all; one writes Liber Deuteronomii Hebraice Elle Haddebarim, the other just Deuteronomii. Because the language is Latin, Æ and Œ count as two letters – they're ligatures, not separate letters like in Scandinavian languages – but it's easy enough to count them out separately. I pare the text down to the absolute minimum for this counting project: just the text, no numbers, no titles.

Finally, since the Vulgate is a Catholic bible, it contains all the books of the Apocrypha, as part of the Old Testament, not separated out into their own section like in the King James Version.

Clementine Vulgate (1592)

This was the easiest: I went to Wikisource and clicked the download-as-plain-text button. I got a text file, UTF8-encoded (since it contains characters like æ and ë), 4522972 bytes in size. It needed some processing, since it contains all chapter titles (but oddly, no book titles) and verse numbers and also some front matter: "Exported from Wikisource on September 6, 2022", for example.

I transformed this text file into a numberless version like so, specified here for reproducibility:

With a liberal mouse sweep all front matter was deleted.
All rear matter was also deleted (the Reliqua quæ continentur section, starting with Errata and Ordo librorum [a counting of errors in previous (before the 1598 publication this digital version was based on) editions, and a listing of what order the books come in], then the Hieronymi præfationes subsection ["Jerome's prefaces", notes from the translator], then the Apocryphi libri [containing a short note of why these non-canonical books are still contained in this publication, then the Prayer of Manasses and the Third and Fourth books of Ezra], then the Appendices [a list of testimonies and a glossary or dictionary of Hebrew, Chaldean and Greek names and words], then a final "about this digital edition" section).
Chapter titles were found and removed with the regular expression /Caput [[:digit:]]+/. There were 1341 such titles removed.
Verse numbers were found and removed with the regular expression /[[:digit:]]+ /. There were 35558 such numbers removed.
The first text in the file was now "In principio creavit Deus cælum et terram. Terra autem erat inanis et vacua, et tenebræ erant super faciem abyssi : et spiritus Dei ferebatur super aquas." and the last text was "Etiam venio cito : amen. Veni, Domine Jesu. Gratia Domini nostri Jesu Christi cum omnibus vobis. Amen."
The numberlessness of the file was verified with a program I wrote that would count how many times each character occurred. There were some occurrences still remaining – 57 ones, 35 twos, 23 each of threes, fours, and fives, 20 sixes, and 12 each sevens, eights, nines, and zeroes: a distribution consistent with there still being some verse numbers left. 229 digits left in total.
Many of these leftover verse numbers were contained in Lamentationes, attached to the following word. These were removed with the regular expression /[[:digit:]]+/ (no space). The 229 digits were found in 133 total numbers. (In retrospect, I should've used that regex in the first place.)
There were no digits left in the file.
File size at this point is down from 4.5 megabytes to 4112125 bytes – a reduction of almost 411 kilobytes, or a quarter of a floppy disk.

This numberless version contains just the text of the scripture itself, with no verse numberings. I ran my character analysis program again (which just counts how many times each Unicode character appears), and I learned the following:

There were 16506 newlines – there are a lot of them between chapters and between books, and four blank lines before all of the text and one blank line after. Chapters don't often have paragraph breaks in them, so lines are long.
There were 642853 spaces.
In addition to the twenty-five letters A to Z (W doesn't occur, while K does, 49 times) there occur the ligatures Æ (22863 times, of which 878 are uppercase) and Œ (always lowercase, 956 times) and the one letter with diacritics, ë.
As for punctuation: there's the expected set of . , : ; ? !. (Regarding punctuation, the Clementine Vulgate Project itself states "Punctuation, which varies widely between different editions, has been chosen with readability in mind; the text is divided into paragraphs for the same purpose.") In addition to this, there are less-expected symbols:
- The slash / occurs 68 times, first in Liber Joshua (chapter 10, in the paragraph Habitatores autem Gabaon urbis obsessæ miserunt ad Josue..., between verses 12 and 13) where it seems to be a typo: the scan they use doesn't have verse numbers in the text, but rather the verse number is in the margin, and there is a star in the text. It doesn't seem to be a marker for a cross-reference either, as other cross-references don't have a similar slash. It is then used 64 times in chapter 5 of Liber Judicum, to mark different voices and verses of song. The final three times are in chapters 14 and 15, again as a sort of speech delimiter, quite irregularly though, since it's only in these places. I remove all these slashes.
- From my numberless version: Sol, contra Gabaon ne movearis, et luna contra vallem Ajalon./ Steteruntque sol et luna, donec ulcisceretur se gens de inimicis suis.
- The angle brackets, or technically the less-than and greater-than signs < >, both occur 66 times.
  - They're used 22 times in the Liber Psalmorum, psalm 118 (CXVIII), to mark Hebrew letters: <Aleph>, <Beth>, <Ghimel> and so on; in the scan of the 1598 edition, they appear as subheadings, centered and in all-caps; other digital editions represent the Hebrew letters as just a one-word sentence (Aleph.) (and it turns out there are some spelling differences: Gimel vs Ghimel; I'm not going to make judgements here, and will count whatever the file uses); I will consider these scripture.
  - They're used again, 42 times, in Canticum Canticorum to mark out <Sponsa>, <Chorus Adolescentularum>, <Sponsus>, <Chorus Fratrum>; the scanned edition doesn't indicate these at all, so I consider these metadata.
  - They're used once in Prophetia Baruch, chapter 6, in a 23-word parenthetical "Exemplar epistolæ quam misit Jeremias..." (the Epistle of Jeremiah), that in other editions and translations is a proper verse of its own.
  - The final place they're used is at the starting <Prologus> of In Ecclesiasticum Jesu Filii Sirach prologus, a (what I assume to be non-scriptural) one-paragraph text that I thought I got rid of when I deleted the back matter, but which for some reason is in the middle of the text file – the Wiktionary edition seems to link the books circularly. This book was placed after the end of Liber Sapientiæ, at the start of Liber Ecclesiasticus – but when I went to check out the scan I discovered that it indeed is present in the original: there, on page 605, the book In Ecclesiasticum Jesu Filii Sirach Prologus starts with Multorum nobis, et magnorum per legem; however, then the Liber Ecclesiasticus proper starts with Omnis sapientia a Domino Deo est. I don't really know what to make of this; Multorum nobis doesn't have verse numbers, Omnis sapientia does, so I decided to delete the Multorum nobis section, from Multorum nobis et magnorum to ...Domini proposuerint vitam agere. (Someone should probably remove this section in Wikisource.)
- Pairs of round brackets ( ) are used 231 times. The first occurrence is in chapter 6, verse 12 of Liber Genesis (Cumque vidisset Deus terram esse corruptam (omnis quippe caro corruperat viam suam super terram),), the last in chapter 18, verse 12–13 of Apocalypsis (merces auri, et argenti, et lapidis pretiosi, et margaritæ, et byssi, et purpuræ, et serici, et cocci (et omne lignum thyinum, et omnia vasa eboris, et omnia vasa de lapide pretioso, et æramento, et ferro, et marmore, et cinnamomum) et odoramentorum ...). The original books contain these, so I will retain them.
- Pairs of square brackets [ ] are used 53 times, all in Prophetia Ezechielis, to mark speech, much like quotation marks would be used. I remove all of these, since they're used only in Ezechiel.
- I get rid of all the slashes and square brackets with a simple search-and-replace. I get rid of the angle brackets with two cases: first, removing the extra metadata in Canticum Canticorum with the regex /<(Spons(a|us)|Chorus( [AF][a-u]+um)?)>/. The parenthetical at the start of the Epistle of Jeremiah in chapter 6 of Baruch is manually de-angle-bracketed, the extra prologue at the start of Ecclesiastes is removed, and the angle brackets in psalm 118 are removed from around the Hebrew letters with the search-and-replace regex s/<([A-Z][a-z]+)>/\1. /.
After all this processing, the character set is down to the alphabet (sans W), Æ, Œ, Ë, space, newline, and the punctuation !(),.:;?. File size is 4109928 bytes. Chapters and books are separated by eight blank lines, or nine newline characters; there's no empty lines at the start and end, but the file does contain a final newline.

Character counts of this bare version:

Letter	Count
E	388739
I	369731
T	296724
U	270861
S	251728
A	243707
N	210574
R	192641
M	187698
O	176608
C	117199
D	111399
L	89526
P	76914
B	51585
V	48758
Q	46102
G	34300
F	30829
H	27014
Æ	22851
J	18622
X	15244
Y	3551
Ë	3017
Z	1764
Œ	956
K	49
Total	3288691
E+Ë+Æ+Œ	415563
I+J	388353
U+V	319619
A+Æ	266558
O+Œ	177564
Symbol	Count
,	73681
.	30527
:	24036
?	3228
;	3097
!	243
)	231
(	231
Total	135274
Spaces	642639
Newlines	16500

With Ë merged into E, Æ expanded into AE and Œ expanded into OE, we get 415563 E's, 266558 A's and 177564 O's, and increasing the total letter count from 3288691 to 3312498.

I continued further simplification of the file: all letters into lowercase, all ligatures expanded, all punctuation removed, all repeaded spaces collapsed into one. This allows me to produce a word count: 615596 total words, 45501 distinct words.

The most common word is et, with 50969 occurrences (no wonder that it's always written as the ampersand & in the physical book); the second most common is in, with 22937 occurrences; est has 8849, ad has 7695, non has 7363, qui has 6805, ejus has 5190, autem has 5130, de has 4871, ut has 4670, cum has 4183, dominus has 3608. There are 20664 words that only occur a single time; sorted alphabetically, the first five are aasbai, abaddon, abalienabit, abalienavit, abana, and the last five are zethu, zizaniorum, zoheleth, zomzommim, zuzim. Many of these hapax legomena are just inflected forms of otherwise also occurring words: conversabitur, conversare, conversatur, conversentur and converseris all occur only once, but are all inflections of the verb converso, "I turn around".

The letter K is rare in Latin, and in this book it only occurs 49 times, in the names Joakim (46 times) and Eliakim (thrice).

Index