Quantifying the Jewish Canon: Computational Analyses of Classical Hebrew Texts

May 23, 2025

The corpus of classical Jewish literature represents one of humanity's most extensive and sustained textual traditions, spanning millennia of continuous intellectual engagement. From the biblical canon to the medieval legal codes, these texts have traditionally been measured by their spiritual and intellectual weight rather than their physical dimensions.

In our age of digital humanities and computational text analysis, understanding the quantitative aspects of these works can provide new insights into their composition, development, and relative significance within the tradition.

This piece provides an overview and synthesis of my research on word counts across the Jewish textual canon, drawing on computational analyses I've conducted over the past two years.1 My work represents an attempt to quantify these texts using modern computational methods.

The quantification of sacred texts may initially seem reductive, but it opens new windows into understanding the Jewish textual tradition. Word counts provide objective metrics for comparing textual corpora, help scholars contextualize the relative size and scope of different works, and offer insights into classic Jewish literature that weren't previously feasible.

Outline

Methodology and Data Sources
Words Counts
1. The Jewish Textual Universe: Macro-Level Findings
2. The Material Context of Textual Length: Scrolls and Word Counts of Biblical Books
3. The Mishnah-Tosefta Relationship: Quantitative Insights
4. Character Counts in Individual Mishnaic Sections (=mishnayot)
5. Analyzing the Talmud Bavli: Micro-Level Insights
6. Word Counts and Authority
7. Talmud Page (=amud) Density and Textual Analysis
Implications for Jewish Scholarship and Digital Humanities
1. For Traditional Scholarship
2. For Digital Humanities
Conclusions and Future Directions
References

Methodology and Data Sources

My research methodology demonstrates the potential of digital humanities approaches to traditional religious texts. My primary data sources include the following major digital repositories of Jewish texts:

Academy of the Hebrew Language's Ma'agarim database
Bar Ilan Responsa Project (BIRP)
Sefaria
Hebrew Wikisource

For the Talmud Bavli chapters specifically, I developed a Python script to analyze transcriptions in Hebrew Wikisource, carefully filtering the content to count only the relevant Talmud text.

The methodological challenges I encountered are instructive for similar computational research. I note that the BIRP word counts are "somewhat inflated" as they include tags, labels, and auxiliary texts such as the Mishnah within Talmudic tractates or the glosses of Ra'avad within Mishneh Torah.

By contrast, my custom script for analyzing Talmudic chapters specifically excluded:

Mishnaic text (marked by "מתני'" in the source)
Wikisource's citation markers (starting with "מתוך:")
Text within parentheses (primarily biblical citations)
Wikisource's daf (folio) citations

This careful filtering ensures that the word counts represent only the core text rather than later additions or editorial apparatus. Such methodological precision is essential for meaningful quantitative analysis of historical texts.

The Jewish Textual Universe: Macro-Level Findings

At the highest level, my research provides a comprehensive overview of the relative sizes of major Jewish textual corpora:

Several patterns immediately emerge from this data. Most strikingly, the Babylonian Talmud (Talmud Bavli) dwarfs all other works.

Also notable is the similar length of the three major Jewish legal codes—Mishneh Torah, Tur, and Shulchan Aruch—each containing roughly 700,000-900,000 words. (This consistency may suggest an optimal size for comprehensive legal codes, balancing the need for thoroughness with practical usability.)

The Jerusalem Talmud (Talmud Yerushalmi), while substantial at approximately 815,000 words, is less than half the size of its Babylonian counterpart.

The Hebrew Bible’s comparatively modest word count (approximately 307,000 words) compared to later rabbinic works is particularly striking, highlighting how the relatively concise biblical text generated exponentially larger bodies of commentary and legal derivation.

The Mishnah, at roughly 192,000 words, is even more compact, yet it formed the foundation for the massive Talmudic projects that followed.

The Material Context of Textual Length: Scrolls and Word Counts of Biblical Books

My quantitative analysis of Jewish texts also extends to examining the relationship between physical media and textual length in the ancient world. Unlike later Jewish works (in medieval period and on), which developed after the codex (bound book) became the default medium, the biblical (and Talmudic) texts were first written on scrolls. The material constraint of scrolls (vs. codex) significantly influenced the length and structure of biblical books.

The books of the Torah (Pentateuch) average around 16,000 words each, ranging from approximately 12,000 words in Leviticus to 20,600 words in Deuteronomy. For comparison, this length would translate to roughly 40 pages in a modern academic text format. The size of these books appears calibrated to what could reasonably fit on a standard scroll while remaining physically manageable for regular use in liturgical and study contexts.

Similarly, the Former Prophets (Nevi'im Rishonim) range between 9,900 words (Judges) and 22,000 words (Jeremiah). When books became too lengthy for a single scroll, they were divided into parts. This explains the division of Samuel, Kings, and Chronicles into two books each, with their sections averaging 12,100-12,800 words. This practice of dividing texts according to physical constraints rather than purely content-based considerations highlights how material technologies shaped the fundamental structure of these sacred texts.

The Twelve Minor Prophets provide a particularly illuminating example of this phenomenon. These twelve short works were combined into a single scroll (totaling 14,400 words) to ensure their preservation, as suggested by the Talmud. Meanwhile, the five Megillot (scrolls) each remain quite concise, ranging from 1,300 words (Ruth) to 3,100 words (Esther), reflecting their use in specific festival contexts where shorter texts would be more practical for public readings.

The Mishnah-Tosefta Relationship: Quantitative Insights

In a related study, I explored the quantitative relationship between the Mishnah and its companion work, the Tosefta. While scholars have long known that the Tosefta generally “expands” (in some sense, whether synchronically or diachronically) on the Mishnah's content, my word count analysis provides precise measurements of this expansion.

The Mishnah contains 187,875 words across all its chapters, while the Tosefta encompasses 294,241 words - approximately 1.57 times more content. On average, a Mishnah chapter contains 362 words, compared to 699 words for Tosefta chapters. This nearly 2:1 ratio quantifies the extent to which the Tosefta elaborates on Mishnaic material.

The distribution of chapter lengths also reveals interesting patterns. Mishnah chapters cluster around 300-400 words, with a relatively tight distribution, suggesting a certain standardization in chapter length. In contrast, Tosefta chapters show a much broader distribution, peaking around 500-600 words but extending all the way up to 1,600 words. This greater variability suggests that the Tosefta's elaboration is not uniform across all topics but focuses more extensively on certain areas.

The longest chapter in the Mishnah is Sotah 9 with 873 words, while the Tosefta's longest chapter is Yoma 2 with 1,571 words. At the other extreme, the shortest chapters are Shabbat 4 in the Mishnah (111 words) and Meilah 3 in the Tosefta (104 words). These outliers invite further investigation into why these particular chapters received such divergent treatment in terms of length.

These quantitative findings complement traditional scholarly approaches by providing objective metrics for comparing texts and understanding their structural relationships. They also raise intriguing questions about the editorial processes behind these works and the relative importance assigned to different topics within the rabbinic tradition.

Character Counts in Individual Mishnaic Sections (=mishnayot)

My quantitative research extends beyond aggregate word counts to the micro-level analysis of individual textual units. In another study, I counted the number of characters in each of the approximately 4,100 individual mishnayot (singular sections of the Mishnah), revealing notable patterns about the composition and structure of this foundational text.

This granular analysis identified the longest individual mishnah as Sotah 9:15, containing 1,242 characters, followed closely by Yadayim 4:3 with 1,195 characters. These exceptionally lengthy sections fall into two distinct categories: those containing extended aggadic (narrative or homiletical) material, and those featuring extensive back-and-forth halachic discussion—a format more commonly associated with the later Talmudic literature than with the typically concise Mishnah.

The second-longest individual mishnah (Yadayim 4:3) presents a particularly illuminating case study. It records a relatively complex legal debate regarding agricultural tithes from Ammon and Moab during the sabbatical (seventh) year. The debate follows a relatively sophisticated logical structure, beginning with initial positions, progressing through arguments about the burden of proof, geographical comparisons, pragmatic versus spiritual considerations, and methodological principles about which precedents should apply. The section culminates with a formal vote followed by validation through an appeal to tradition dating back to Moses at Sinai.

This micro-level quantitative analysis reveals important aspects of the Mishnah's composition that might otherwise remain obscure. While the vast majority of mishnayot are quite brief (most containing fewer than 400 characters), these exceptional outliers demonstrate the text's occasional departure from its characteristic brevity, suggesting that the editors were sometimes willing to deviate from the otherwise consistent format.

Analyzing the Talmud Bavli: Micro-Level Insights

My analysis of individual chapters within the Talmud Bavli provides particularly valuable insights into the internal composition of this monumental work. My examination of 304 chapters (out of 308 total) reveals:

Median word count per chapter: 4,575
Average word count per chapter: 5,503
Cumulative word count: 1,672,827
Median number of folios per chapter: 7.5
Average number of folios per chapter: 9

The discrepancy between median and average indicates that the distribution is massively skewed, with some exceptionally large chapters pulling the average upward. The largest chapters by word count are:

Kiddushin Chapter 1 (האשה נקנית):
1. 27,694 words
Sanhedrin Chapter 11 (חלק):
1. 27,151 words
Chullin Chapter 3 (אלו טרפות):
1. 16,714 words

(Sanhedrin Chapter 11 [traditionally referred to as "Chelek"] contains the Talmud's most extensive discussions of theological matters, including principles of faith, the messianic era, and the afterlife.)

Presumably, as with the Biblical books mentioned earlier, the length of Talmudic chapters was constrained by the physical limits of a scroll. (See Shamma Friedman on the subject of Talmud chapters and their formatting in scrolls.)

At the tractate level, the five largest are:

Shabbat - 112,673 words
Sanhedrin - 100,182 words
Chullin - 88,190 words
Bava Kamma - 83,834 words
Bava Batra - 83,490 words

These tractates address central areas of Jewish law: Sabbath observance, legal procedure and criminal law, dietary laws, and civil damages (respectively).

Conversely, the smallest tractates (excluding Tamid, which is incomplete in the Babylonian Talmud) are:

Meilah - 6,581 words
Horayot - 11,380 words
Makkot - 15,034 words

These tractates address more specialized topics, such as misuse of consecrated property (Meilah) and incorrect rulings by the Sanhedrin (Horayot).

Word Counts and Authority

There is an intriguing inverse relationship between a text's authority and its word count within the Jewish tradition. The Hebrew Bible, as the foundational divine text, is relatively concise at around 307,000 words. The Mishnah, as the first major rabbinic work, is even more compact at approximately 192,000 words. Later works expand exponentially as they interpret and elaborate on these foundational texts.

Talmud Page (=amud) Density and Textual Analysis

While total word counts provide valuable macro-level insights into the Jewish textual tradition, my research has also uncovered fascinating patterns at more granular levels. In a recent extension of this work, I examined the word density of individual pages (amudim) within the Talmud Bavli, seeking to identify the "densest daf" - the page containing the most text in the standard printed format that has defined Talmudic pagination for the past 500 years.

This computational analysis revealed that Berakhot 32a contains 883 Hebrew words, making it the wordiest page in the entire Talmud. This is followed closely by other pages in Tractate Berakhot: 7a (858 words), 10a (856 words), and 58a (853 words). In fact, six of the top ten wordiest pages come from this tractate, revealing a striking concentration of text-dense material within Berakhot.

What makes this finding particularly significant is its heuristic potential. Word density serves as a surprisingly effective proxy for identifying aggadic (narrative and homiletical) material within the Talmud. Aggadic pages consistently show higher-than-average word counts due to the relative absence of Tosafot commentary, while halakhic (legal) discussions typically have lower word counts per page, as they are accompanied by more extensive commentaries.

This correlation provides a computational method for distinguishing between different types of Talmudic content without requiring sophisticated natural language processing techniques.

At the other end of the spectrum, the pages with the lowest word counts include Bava Kamma 77a (9 words), Yoma 56a (12 words), and Zevachim 71a (17 words). These extremely sparse pages typically feature extensive commentaries that occupy most of the available space on the page, leaving little room for the primary text.

Implications for Jewish Scholarship and Digital Humanities

My research has several important implications for both traditional Jewish scholarship and the emerging field of digital humanities:

For Traditional Scholarship

Curriculum Planning: Understanding the relative size of different texts can help educators plan realistic curricula and study schedules. The significant variation in chapter sizes within the Talmud, for instance, has direct implications for allocating study time.
Historical Insights: The quantitative dimensions of texts provide clues about their historical development and the priorities of their redactors. The extensive treatment of certain topics suggests their centrality to the rabbinic project.
Comparative Analysis: Word counts enable more precise comparisons between different texts and traditions. For example, the similar lengths of medieval legal codes (Mishneh Torah, Tur, and Shulchan Aruch) suggest a practical upper limit for comprehensive halakhic works.

For Digital Humanities

Methodology Development: My work demonstrates how computational methods can be applied to traditional religious texts while respecting their unique features and structures.
Database Refinement: The discrepancies between different databases (BIRP vs. Ma'agarim) highlight the importance of careful metadata design and transparent counting methodologies.
New Research Questions: Quantitative analysis raises new questions about textual patterns that might not emerge from traditional close reading approaches. For example, do word counts correlate with the frequency of certain legal or rhetorical strategies?

Conclusions and Future Directions

This research on word counts in classical Jewish texts fills a surprising gap in our knowledge about these foundational works. By providing concrete quantitative data, this work enables more precise discussions about the relative size and scope of different texts within the tradition.

This research also demonstrates the value of digital humanities approaches to religious texts. While quantitative measures cannot replace qualitative analysis, they complement traditional scholarly methods by revealing patterns and relationships that might otherwise remain obscure.

Several promising avenues for future research emerge from this work:

Chronological Analysis: Examining how word counts and text lengths change over time could illuminate the evolution of Jewish literary forms and genres.
Content Analysis: Combining word counts with analysis of content types (legal, narrative, exegetical) might reveal patterns in how different genres are distributed across texts.
Pedagogical Applications: Developing realistic study schedules and curricula based on quantitative measures of text length rather than traditional units like chapters or folios.
Extended Corpora: Applying similar methodologies to later commentarial traditions would provide a more complete picture of the Jewish textual ecosystem.
Comparative Religious Studies: Comparing word counts across different religious traditions could yield insights into varying approaches to religious textuality.

This research reminds us that even the most traditional texts can benefit from new methodological approaches. By quantifying the Jewish canon, this work opens new windows into understanding one of humanity's most extensive and enduring textual traditions.

References

See the bibliography at the end of this piece.

Discussion about this post

Ready for more?