Automated Aggada formatting - splitting into sections and lines, bolding verses, and underlining sages (sampling the beginning of Sanhedrin Perek Chelek)

Jan 24, 2024

A continuation of my recent piece: “Identifying the Most Quoted Sages in the Talmud's Aggada: A Programmatic and Quantitative Study“, using the same base text of Sefaria ed. of Ein Yaakov. See also my previous pieces on formatting the Talmud, indexed here, section “Digital Layout of the Talmud and other Rabbinic Texts“.

Here I discuss a programmatic solution designed to apply specific formatting rules for enhanced readability and structure. This algorithmically breaks down large text blocks of Talmudic text into manageable sections and paragraphs, making the sugyot easier to read and navigate.

I used the docx Python library. See Manoj Das, “Python-docx: A Comprehensive Guide to Creating and Manipulating Word Documents in Python”:

The docx Python library is a popular tool used for working with Microsoft Word files in the .docx format. It allows you to create, modify, and extract information from Word documents programmatically using Python code.

Key Functionalities of the algorithm

Section and Paragraph Splitting: Within each section, the text is split into paragraphs at periods or question marks.
Headers. The first paragraph in a section is set as a header.
Bolding Biblical verses (Quoted Text): The script automatically detects and bolds Biblical verses, based on text within quotation marks. This provides visual emphasis on key parts of the text.
Underlining Specific Patterns: Underlines many instances of sages making a statement, based on text matching a particular pattern (אמר רבי..., see my previous piece on this), adding another layer of readability.

Besides for the great helpfulness in punctuation, the bolding of Biblical verses is especially helpful in aggadic passages, where the Talmud is constantly doing a close reading of Biblical verses.

Comparison - screenshots

The base text and punctuation that I used, Sefaria presentation (Ein Yaakov, Sanhedrin.11.2, showing punctuation, without nikud; Sefaria doesn't have an option for this text to further split into paragraphs):

In regular Sefaria ed. and presentation (Sanhedrin.90b.2, split into paragraphs; Sefaria doesn't have punctuation for this tractate):

My presentation, based on algorithmic formatting (font: Georgia):

General Application and Utility

This script, using the docx Python library, is a practical tool for converting and formatting textual data into a structured Word document. Its capabilities in splitting text into sections and paragraphs, along with specific formatting like bolding and underlining, make it a valuable asset for efficient data presentation.

This Python script is useful in scenarios where punctuated text needs to be presented in a well-structured and easily navigable Word or PDF document. It's particularly beneficial for those who regularly work with text-heavy data and require a tool to automate the formatting process, saving time and effort.

Future improvements

Headers are duplicated, need to fix
Fix acronyms messing up the bolding:
1. מ"א=מלכים א'
2. מ"ב=מלכים ב'
3. ש"א=שמואל א'
4. ש"ב=שמואל ב'
5. דה"א=דברי הימים א'
6. דה"ב=דברי הימים ב'
Expand the pattern matching for names of sages (see my previous piece on this)

Appendix - Final code

# Import required libraries
import pandas as pd
import re
from docx import Document
from docx.enum.section import WD_SECTION

# Function to create a Word document from CSV
def csv_to_word(csv_file, word_file):
    # Read the CSV file
    df = pd.read_csv(csv_file)

    # Create a new Word document
    doc = Document()

    # Regular expression pattern for splitting by period or question mark, keeping them
    split_pattern = r'(?<=\. )|(?<=\? )'

    # Pattern for text within quotation marks
    quote_pattern = r'\"(.*?)\"'

    # Pattern for the specific format to underline
    underline_pattern = r'אמר ר[^:]*:'

    # Initially set the first_row flag to True
    first_row = True

    # Loop through each row in the DataFrame
    for index, row in df.iterrows():
        # Split the text in the cell by period or question mark, keeping them
        paragraphs = re.split(split_pattern, row[0])

        # Flag to identify the first paragraph in each row
        first_paragraph = True

        # Add a section break before the first paragraph of each row, except the very first row
        if not first_row:
            doc.add_section(WD_SECTION.NEW_PAGE)
        first_row = False

        # Add each split segment as a new paragraph in the document
        for para in paragraphs:
            # Check if the segment is not empty
            if para.strip():
                if first_paragraph:
                    # Add the first paragraph as a heading
                    p = doc.add_heading(para.strip(), level=2)
                    first_paragraph = False
                else:
                    # Add subsequent paragraphs as normal paragraphs
                    p = doc.add_paragraph()

                # Process and add text segments for quotes
                last_idx = 0
                for match in re.finditer(quote_pattern, para):
                    # Add text before the quote
                    p.add_run(para[last_idx:match.start()])
                    # Add quoted text in bold
                    p.add_run(match.group(0)).bold = True
                    # Update the last index
                    last_idx = match.end()

                # Add remaining part of the paragraph after quotes
                temp_para = para[last_idx:]

                # Process and add text segments for underline pattern
                last_idx = 0
                for match in re.finditer(underline_pattern, temp_para):
                    # Add text before the pattern
                    p.add_run(temp_para[last_idx:match.start()])
                    # Add pattern text with underline
                    p.add_run(match.group(0)).underline = True
                    # Update the last index
                    last_idx = match.end()

                # Add remaining part of the paragraph after underline pattern
                p.add_run(temp_para[last_idx:])

    # Save the document
    doc.save(word_file)

# Replace 'testsheet5.csv' with the path to your CSV file
# Replace 'output.docx' with your desired Word document name
csv_to_word('testsheet5.csv', 'output.docx')

Automated Aggada formatting - splitting into sections and lines, bolding verses, and underlining sages (sampling the beginning of Sanhedrin Perek Chelek)

Comparison - screenshots

Future improvements

Appendix - Final code

Discussion about this post