Scripting the Talmud: Automated Talmudic Text Extraction and Formatting
Programmatically extracting and formatting Talmudic text from Sefaria using Google Apps Script
In today's digitized world, automation of repetitive tasks can lead to significant improvements in efficiency and productivity. One such task is the extraction and formatting of text from online resources. In this post, we will discuss the successful automation of such a task, specifically focused on the extraction of Hebrew text from an online Judaic resource, Sefaria.org, and its subsequent formatting using Google Apps Script.
We wanted our Talmudic daf to be a similar style to that of Prof. Menachem Katz’s Machberot Menachmiyot (Katz, p. 3):
Instead of numbering lines serially for the whole page, we wanted to retain Sefaria’s section numbers (there are generally between 10 to 15 sections per page of the Talmud).
This was our final Talmud daf, from today’s Daf Yomi (https://www.sefaria.org.il/Gittin.7a)[1] :
How I did it:
Sefaria provides an API that enables us to programmatically access its vast content repository. We leveraged this API to extract this specific piece of Hebrew text from the Talmud.
The task was split into two primary operations:
Fetching the desired text via the Sefaria API, and
Inserting this text into an existing Google document.
However, the task had some specific requirements. The text had to be extracted without nikud (diacritical signs used in Hebrew text), split after each period, and section numbers had to be included.
Fetching the Text:
To fetch the text, we employed the UrlFetchApp.fetch() method provided by Google Apps Script, pointing it towards the appropriate Sefaria API endpoint. The response, in JSON format, contained various details about the text, including its Hebrew version.
Formatting and Inserting the Text:
To insert the text into the Google document, we used the DocumentApp service of Google Apps Script. This service allows scripts to create, access, and modify Google Docs files. We opened the desired document by its ID and fetched its body.
The specific formatting requirements posed an interesting challenge.
To remove the nikud from the Hebrew text, we utilized the JavaScript replace() function with a regular expression that matches all Hebrew diacritical signs and replaces them with an empty string.
To split the text after each period, we used the JavaScript split() method, splitting the text into an array of sentences.
Lastly, to include the section and sentence numbers, we looped through the array of sentences, appending each sentence as a new paragraph to the Google Doc's body, and prefixed each sentence with its corresponding section and sentence number.
Here’s the final code that I used (you'll need to replace 'your-document-id' with the actual ID of your Google Doc):
function fetchAndInsertHebrewText() {
var url = "https://www.sefaria.org/api/texts/Gittin.7a";
var response = UrlFetchApp.fetch(url);
var dataAll = JSON.parse(response.getContentText());
var doc = DocumentApp.openById(‘your-document-id');
var body = doc.getBody();
var dataHebrew = dataAll['he'];
for (var i = 0; i < dataHebrew.length; i++) {
if (dataHebrew[i]) {
// Remove nikud
var hebrewTextWithoutNikud = dataHebrew[i].replace(/[\u0591-\u05C7]/g, '');
// Split text after each period
var sentences = hebrewTextWithoutNikud.split('.');
for (var j = 0; j < sentences.length; j++) {
if (sentences[j]) {
body.appendParagraph("§ " + (i+1) + " # " + (j+1) + ": " + sentences[j]);
}
}
}
}
}
fetchAndInsertHebrewText
[1] The font is Frank Ruhl Libre, Medium.