Pt.3 of Scripting the Talmud: Automated Talmudic Text Extraction and Formatting - Emulating Sefaria’s Bilingual Talmud in a Google Doc with Google Apps Script
Part of series. See previous here: “Scripting the Talmud Part#2: Automated Rashi Text Extraction and digital layout of tzurat daf” (June 2, 2023) and here: “Scripting the Talmud: Automated Talmudic Text Extraction and Formatting” (May 24, 2023). For a recent review of contemporary digital layouts of the Talmudic page, see my Seforim Blog post: “From Print to Pixel: Digital Editions of the Talmud Bavli” (June 5, 2023)
I developed a Google Apps Script that fetches Sefaria’s Talmud, with Steinzaltz translation and running commentary, from the Sefaria API, and then inserts it side by side in a Google Document.
The reason to emulate Sefaria's layout, is that it's actually quite good. A future open-access interface can start with that layout. Compare also Al-Hatorah's layout (discussed in above-mentioned review in Seforim Blog.)
In the next part I plan to discuss building further on this, especially ideas for a future AI powered experience. Using ChatGPT to summarize the daf, and to create flowcharts, seems to be especially promising. Again, see my Seforim Blog post for further discussion, for now.
Here's a brief overview of the functions (full script in appendix at the end):
Sefaria
Yesterday’s daf yomi, Kiddushin 22a:
My automated emulation, in Google Doc
Side by Side
Splitscreen:
Summary of script
fetchAndInsertTexts():
Fetches bilingual data from the Sefaria API for the text Kiddushin.22a.
Parses the response and sends the data to insertSideBySideTexts() for inserting into a Google Document.
insertSideBySideTexts(dataHebrew, dataEnglish):
Inserts the provided Hebrew and English texts side by side in a Google Document.
For each section:
It appends a section number.
Creates a table with one row and two columns.
Inserts the English text in the left column.
Inserts the Hebrew text in the right column, with nikud (vowel markings) removed and aligned to the right.
Calls formatTextBasedOnHtmlTags() to format the text based on HTML tags.
formatTextBasedOnHtmlTags():
Scans the document for text wrapped in <b> (bold) and <i> (italic) tags.
Applies the respective bold or italic formatting.
Removes the HTML tags from the document.
Fonts used
For Hebrew: The typeface "Frank Ruhl Libre" and font size 14.
For English: The typeface "EB Garamond" and font size 12.
Need to fix
Unlike in Sefaria, I split into lines based on period + space (‘. ’). Need to optimize this, or remove this part of the script.
For optimal emulation, the Google Doc should be set to ‘Page Setup’ > ‘Pageless’.
The bolding transformation is not fully correct. Transforming HTML tags to Google Docs formatting is surprisingly difficult.
Appendix - Full Script
function fetchAndInsertTexts() {
var url = "https://www.sefaria.org/api/texts/Kiddushin.22a?lang=bi"; // Fetch bilingual data
var response = UrlFetchApp.fetch(url);
var dataAll = JSON.parse(response.getContentText());
var dataHebrew = dataAll['he'];
var dataEnglish = dataAll['text'];
insertSideBySideTexts(dataHebrew, dataEnglish);
}
function insertSideBySideTexts(dataHebrew, dataEnglish) {
var doc = DocumentApp.getActiveDocument();
var body = doc.getBody();
for (var i = 0; i < dataHebrew.length; i++) {
// Insert section number
body.appendParagraph("§ " + (i+1)).setAlignment(DocumentApp.HorizontalAlignment.CENTER);
// Create a table with one row and two columns
var table = body.appendTable([[ '', '' ]]);
// Set table width and make the borders invisible
var cols = table.getRow(0).getNumCells();
for (var col = 0; col < cols; col++) {
table.setColumnWidth(col, 200); // Setting each column width to 200, adjust as needed
}
table.setBorderWidth(0); // Making table borders invisible
// Insert English
var englishCell = table.getCell(0, 0);
if (dataEnglish[i]) {
var sentencesEnglish = dataEnglish[i].split('. ').map(s => s.trim());
for (var j = 0; j < sentencesEnglish.length; j++) {
if (sentencesEnglish[j]) {
var para = englishCell.appendParagraph(sentencesEnglish[j]);
para.setFontFamily("EB Garamond").setFontSize(12);
}
}
}
// Insert Hebrew
var hebrewCell = table.getCell(0, 1);
if (dataHebrew[i]) {
var hebrewTextWithoutNikud = dataHebrew[i].replace(/[\u0591-\u05C7]/g, '');
var sentencesHebrew = hebrewTextWithoutNikud.split('. ').map(s => s.trim());
for (var j = 0; j < sentencesHebrew.length; j++) {
if (sentencesHebrew[j]) {
var para = hebrewCell.appendParagraph(sentencesHebrew[j]);
para.setFontFamily("Frank Ruhl Libre").setFontSize(14).setAlignment(DocumentApp.HorizontalAlignment.RIGHT); // Align right for Hebrew
}
}
}
}
// Format text based on HTML tags
formatTextBasedOnHtmlTags();
}
function formatTextBasedOnHtmlTags() {
var body = DocumentApp.getActiveDocument().getBody();
// Find all the <b> tags and apply bold formatting
var boldPattern = /<b>(.*?)<\/b>/g;
var match;
while (match = boldPattern.exec(body.getText())) {
var rangeElement = body.findText(match[0]);
if (rangeElement) {
var startOffset = rangeElement.getStartOffset() + 3; // Adjusting for <b>
var endOffset = startOffset + match[1].length - 1;
rangeElement.getElement().asText().setBold(startOffset, endOffset, true);
}
}
// Find all the <i> tags and apply italic formatting
var italicPattern = /<i>(.*?)<\/i>/g;
while (match = italicPattern.exec(body.getText())) {
var rangeElement = body.findText(match[0]);
if (rangeElement) {
var startOffset = rangeElement.getStartOffset() + 3; // Adjusting for <i>
var endOffset = startOffset + match[1].length - 1;
rangeElement.getElement().asText().setItalic(startOffset, endOffset, true);
}
}
// Remove the <b>, </b>, <i>, and </i> tags from the document
body.replaceText("<b>", "");
body.replaceText("</b>", "");
body.replaceText("<i>", "");
body.replaceText("</i>", "");
}
fetchAndInsertTexts();
I am unlikely to use these tools for my own learning, but the fact that they exist now is potentially revolutionary. kudos for innovating their possible uses and popularizing them. exciting times!