How To Download Sefaria's Entire Talmud At Once
As part of a project to analyze the Talmud using natural language processing (specifically, using the Natural Language Toolkit library, or NLTK), I wanted to extract the Steinzaltz translation of the Talmud, available on Sefaria.1
Sefaria allows downloading individual tractates as a text or CSV file, directly from their website.
However, as far as I can tell, there’s no way to download the entire Talmud at once.
For this, the fastest way to do it as at their Github. But even there, you can only download each tractate at a time.
The full text of the tractates are found at Sefaria’s Github.
For example, here’s the Steinzaltz translation for tractate Yevamot.
See a discussion about his at Stackexchange’s Judaism website, a year ago.
I developed a code to iterate through all tractates, and to extract the Steinzaltz translation and commentary for all tractates. After extracting, I used additional code (specifically, a regex) to copy only the parts in bold; meaning, only the actual translation, excluding the commentary.2
The code for pulling the text from Sefaria is in the appendix.
Appendix - Full Python script
import requests
from google.colab import files
# Base URL for raw content from GitHub
base_url = "https://raw.githubusercontent.com/Sefaria/Sefaria-Export/master/txt/Talmud/Bavli"
# Define the seders and their tractates
seders = {
"Seder Kodashim": ["Arakhin", "Bekhorot", "Chullin", "Keritot", "Meilah", "Menachot", "Tamid", "Temurah", "Zevachim"],
"Seder Moed": ["Beitzah", "Chagigah", "Eruvin", "Megillah", "Moed Katan", "Pesachim", "Rosh Hashanah", "Shabbat", "Sukkah", "Taanit", "Yoma"],
"Seder Nashim": ["Gittin", "Ketubot", "Kiddushin", "Nazir", "Nedarim", "Sotah", "Yevamot"],
"Seder Nezikin": ["Avodah Zarah", "Bava Batra", "Bava Kamma", "Bava Metzia", "Horayot", "Makkot", "Sanhedrin", "Shevuot"],
"Seder Tahorot": ["Niddah"],
"Seder Zeraim": ["Berakhot"],
}
# Filename for the master text file
master_filename = "/content/master_text_file.txt"
# Ensure the master file is empty before starting
open(master_filename, 'w').close()
# Iterate through each seder and tractate
for seder, tractates in seders.items():
for tractate in tractates:
# Construct the URL for the tractate's English text
url = f"{base_url}/{seder}/{tractate}/English/William Davidson Edition - English.txt".replace(" ", "%20")
# Fetch the tractate text
response = requests.get(url)
if response.status_code == 200:
text = response.text
# Append the text to the master file
with open(master_filename, "a") as master_file:
master_file.write(f"---{tractate}---\n{text}\n\n")
else:
print(f"Failed to fetch {tractate} from {seder}. Status Code: {response.status_code}")
# Download the master file to your local machine
files.download(master_filename)
The Steinzaltz translation of the Talmud has been previously used for quantitative research, see Joshua Waxman, “A graph database of scholastic relationships in the Babylonian Talmud”, Digital Scholarship in the Humanities, Volume 36, Issue Supplement_2, October 2021, Pages ii277–ii289.
He explains there why he used the translation, instead of the original Hebrew (the hyperlink is mine):
The challenges of Named Entity Recognition on the Hebrew script text, and the availability of this parallel resource led us to an approach in which we perform Named Entity Recognition first on the English text, and then project those named entities onto the Hebrew script text. It is more straightforward to identify named entities in the English text based on capitalization.
See Waxman, ibid.: “For each Hebrew or Aramaic paragraph, they [=Steinzaltz] provide a somewhat wordy translation, with the literal translation in bold and the gloss text nonbolded.”
The final number of words was 3,161,693. Since the original Hebrew of the Talmud is ~1.8 million words, this gives us a ratio of very roughly 3:2 words of translation vs. the original.