Quantifying the Talmud: A Technical Dive into Chapter Word Counts
Continuation of the previous piece:
In this piece, I describe the procedure I employed to figure out the word counts of individual chapters from the Talmud Bavli.
Data Acquisition
The source for this analysis was the Hebrew-language Wikisource website, which is the main source for open-access, transcribed Jewish texts. (On this resource, see my “Guide to Online Resources for Scholarly Jewish Study and Research - 2023”, pp. 10-11 (Academia.edu, registration required).
My script targeted the following Hebrew Wikisource URL for data collection (two pages, max 200 hyperlinks each):
קטגוריה:פרק בתלמוד הבבלי – ויקיטקסט
Screenshot of this category page:
The above URL lists jyperlinks to all chapters of the Talmud Bavli. My script fetched the full 200 URLs from this list.
Data Extraction and Parsing
I then used the Beautiful Soup library to parse the HTML content of the fetched webpages. For each URL:
The main textual content was identified and extracted from the <div> element with the attribute id="mw-content-text".
Paragraphs were further parsed, filtering out headers and the first Mishnah, by starting with the term "גמ'".
Using regex, parenthetical content was removed to ensure that citation of Bibllical verses did not incorrectly add to the word count.
Word Count Computation
With the main textual content extracted and refined, the word count was determined by splitting the text based on whitespace characters and counting the resulting terms.
Data Compilation
The results, including link text, URL, and word count, were subsequently compiled into a structured format using the pandas library. The data was transformed into a DataFrame, and then into Markdown, facilitating further manipulation (such as sorting) and presentation in Google Sheets.
Results
The script successfully calculated the word count for each of the first 200 chapters listed in the aforementioned URL. Each entry in the resulting dataset consisted of the chapter's title, its URL, and its word count. The script was then run a second time, on the second page, and the results were appended to the same table.
Appendix - the Complete Script
Automatically generated by Google Colaboratory
import pandas as pd
import re
import requests
from bs4 import BeautifulSoup
# Function to convert a DataFrame to markdown format
def df_to_markdown(df):
fmt = ['---' for _ in range(len(df.columns))]
df_fmt = pd.DataFrame([fmt], columns=df.columns)
return pd.concat([df_fmt, df]).to_csv(sep="|", index=False)
# Fetch the list of URLs
list_url = "https://he.wikisource.org/w/index.php?title=%D7%A7%D7%98%D7%92%D7%95%D7%A8%D7%99%D7%94:%D7%A4%D7%A8%D7%A7_%D7%91%D7%AA%D7%9C%D7%9E%D7%95%D7%93_%D7%94%D7%91%D7%91%D7%9C%D7%99&pagefrom=%D7%91%D7%91%D7%9C%D7%99+%D7%A1%D7%95%D7%98%D7%94+%D7%A4%D7%A8%D7%A7+%D7%94#mw-pages"
response = requests.get(list_url)
soup = BeautifulSoup(response.content, 'html.parser')
main_content_div = soup.find("div", {"id": "mw-pages"})
links = main_content_div.find_all("a")
# Limit to first 10 URLs
urls = [("https://he.wikisource.org" + link['href'], link.text) for link in links[:200]]
# Iterate over each URL to fetch the word count
results = []
for url, link_text in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
main_text_div = soup.find("div", {"id": "mw-content-text"})
paragraphs = main_text_div.find_all("p")
# Filter out paragraphs and start output after finding "גמ'"
excluded_starts = ["מתני'", "מתוך:"]
start_collecting = False
filtered_paragraphs = []
for p in paragraphs:
if "גמ'" in p.text:
start_collecting = True
text_after_gimel = p.text.split("גמ'", 1)[-1]
text_without_parentheses = re.sub(r'\(.*?\)', '', text_after_gimel)
filtered_paragraphs.append(text_without_parentheses)
continue
if start_collecting and not any(p.text.startswith(exclude) for exclude in excluded_starts):
text_without_parentheses = re.sub(r'\(.*?\)', '', p.text)
filtered_paragraphs.append(text_without_parentheses)
main_text = "\n".join(filtered_paragraphs)
word_count = sum(1 for word in main_text.split() if word)
results.append((link_text, url, f"{word_count:,}"))
# Convert to DataFrame and output in markdown format
df = pd.DataFrame(results, columns=['Link Text', 'URL', 'Word Count'])
print(df_to_markdown(df))