Pt.3 of Scripting the Talmud: Automated Talmudic Text Extraction and Formatting - Emulating Sefaria’s Bilingual Talmud in a Google Doc with Google Apps Script

Sep 05, 2023

Part of series. See previous here: “Scripting the Talmud Part#2: Automated Rashi Text Extraction and digital layout of tzurat daf” (June 2, 2023) and here: “Scripting the Talmud: Automated Talmudic Text Extraction and Formatting” (May 24, 2023). For a recent review of contemporary digital layouts of the Talmudic page, see my Seforim Blog post: “From Print to Pixel: Digital Editions of the Talmud Bavli” (June 5, 2023)

I developed a Google Apps Script that fetches Sefaria’s Talmud, with Steinzaltz translation and running commentary, from the Sefaria API, and then inserts it side by side in a Google Document.

The reason to emulate Sefaria's layout, is that it's actually quite good. A future open-access interface can start with that layout. Compare also Al-Hatorah's layout (discussed in above-mentioned review in Seforim Blog.)

In the next part I plan to discuss building further on this, especially ideas for a future AI powered experience. Using ChatGPT to summarize the daf, and to create flowcharts, seems to be especially promising. Again, see my Seforim Blog post for further discussion, for now.

Here's a brief overview of the functions (full script in appendix at the end):

Sefaria

Yesterday’s daf yomi, Kiddushin 22a:

My automated emulation, in Google Doc

Side by Side

Splitscreen:

Summary of script

fetchAndInsertTexts():
1. Fetches bilingual data from the Sefaria API for the text Kiddushin.22a.
2. Parses the response and sends the data to insertSideBySideTexts() for inserting into a Google Document.
insertSideBySideTexts(dataHebrew, dataEnglish):
1. Inserts the provided Hebrew and English texts side by side in a Google Document.
2. For each section:
  1. It appends a section number.
  2. Creates a table with one row and two columns.
  3. Inserts the English text in the left column.
  4. Inserts the Hebrew text in the right column, with nikud (vowel markings) removed and aligned to the right.
3. Calls formatTextBasedOnHtmlTags() to format the text based on HTML tags.
formatTextBasedOnHtmlTags():
1. Scans the document for text wrapped in <b> (bold) and <i> (italic) tags.
2. Applies the respective bold or italic formatting.
3. Removes the HTML tags from the document.

Fonts used

For Hebrew: The typeface "Frank Ruhl Libre" and font size 14.

For English: The typeface "EB Garamond" and font size 12.

Need to fix

Unlike in Sefaria, I split into lines based on period + space (‘. ’). Need to optimize this, or remove this part of the script.
For optimal emulation, the Google Doc should be set to ‘Page Setup’ > ‘Pageless’.
The bolding transformation is not fully correct. Transforming HTML tags to Google Docs formatting is surprisingly difficult.

Appendix - Full Script


function fetchAndInsertTexts() {
  var url = "https://www.sefaria.org/api/texts/Kiddushin.22a?lang=bi";  // Fetch bilingual data
 
  var response = UrlFetchApp.fetch(url);
  var dataAll = JSON.parse(response.getContentText());


  var dataHebrew = dataAll['he'];
  var dataEnglish = dataAll['text'];


  insertSideBySideTexts(dataHebrew, dataEnglish);
}


function insertSideBySideTexts(dataHebrew, dataEnglish) {
  var doc = DocumentApp.getActiveDocument();
  var body = doc.getBody();


  for (var i = 0; i < dataHebrew.length; i++) {
    // Insert section number
    body.appendParagraph("§ " + (i+1)).setAlignment(DocumentApp.HorizontalAlignment.CENTER);


    // Create a table with one row and two columns
    var table = body.appendTable([[ '', '' ]]);


    // Set table width and make the borders invisible
    var cols = table.getRow(0).getNumCells();
    for (var col = 0; col < cols; col++) {
      table.setColumnWidth(col, 200);  // Setting each column width to 200, adjust as needed
    }
    table.setBorderWidth(0);  // Making table borders invisible


    // Insert English
    var englishCell = table.getCell(0, 0);
    if (dataEnglish[i]) {
      var sentencesEnglish = dataEnglish[i].split('. ').map(s => s.trim());
      for (var j = 0; j < sentencesEnglish.length; j++) {
        if (sentencesEnglish[j]) {
          var para = englishCell.appendParagraph(sentencesEnglish[j]);
          para.setFontFamily("EB Garamond").setFontSize(12);
        }
      }
    }


    // Insert Hebrew
    var hebrewCell = table.getCell(0, 1);
    if (dataHebrew[i]) {
      var hebrewTextWithoutNikud = dataHebrew[i].replace(/[\u0591-\u05C7]/g, '');
      var sentencesHebrew = hebrewTextWithoutNikud.split('. ').map(s => s.trim());
      for (var j = 0; j < sentencesHebrew.length; j++) {
        if (sentencesHebrew[j]) {
          var para = hebrewCell.appendParagraph(sentencesHebrew[j]);
          para.setFontFamily("Frank Ruhl Libre").setFontSize(14).setAlignment(DocumentApp.HorizontalAlignment.RIGHT); // Align right for Hebrew
        }
      }
    }
  }


  // Format text based on HTML tags
  formatTextBasedOnHtmlTags();
}


function formatTextBasedOnHtmlTags() {
  var body = DocumentApp.getActiveDocument().getBody();
 
  // Find all the <b> tags and apply bold formatting
  var boldPattern = /<b>(.*?)<\/b>/g;
  var match;
  while (match = boldPattern.exec(body.getText())) {
    var rangeElement = body.findText(match[0]);
    if (rangeElement) {
      var startOffset = rangeElement.getStartOffset() + 3; // Adjusting for <b>
      var endOffset = startOffset + match[1].length - 1;
      rangeElement.getElement().asText().setBold(startOffset, endOffset, true);
    }
  }


  // Find all the <i> tags and apply italic formatting
  var italicPattern = /<i>(.*?)<\/i>/g;
  while (match = italicPattern.exec(body.getText())) {
    var rangeElement = body.findText(match[0]);
    if (rangeElement) {
      var startOffset = rangeElement.getStartOffset() + 3; // Adjusting for <i>
      var endOffset = startOffset + match[1].length - 1;
      rangeElement.getElement().asText().setItalic(startOffset, endOffset, true);
    }
  }


  // Remove the <b>, </b>, <i>, and </i> tags from the document
  body.replaceText("<b>", "");
  body.replaceText("</b>", "");
  body.replaceText("<i>", "");
  body.replaceText("</i>", "");
}

fetchAndInsertTexts();