Jastrow's Greek Gems: Or, How I Extracted and Processed All 1000+ Greek loanwords Defined in Jastrow's Dictionary - Pt.1

Finding the text; extracting it; filtering for Greek; transliterating the Greek characters to Latin characters

Sep 24, 2023

Pt.2 will be about: extracting the Greek, and the Jastrow definition, using regex formula; formula to convert traditional Hebrew numbers to modern (Arabic) numbers

I recently extracted and processed all 1158 Greek loanwords defined in Jastrow's Dictionary.

The final result is here:

“A Lexicon of Greek Loanwords in the Talmud and Midrash” (requires registration to view)

Here, I’d like to describe some of the tools, scripts, and Google sheets formulas I used to accomplish this.

I used Hebrew Wikisource’s transcription of Jastrow’s dictionary for my previous two extractions from Jastrow’s Dictionary (“A Lexicon of Place Names in the Talmud and Midrash” and “A Lexicon of Personal Names in the Talmud and Midrash” - both require registration to view). However, Wikisource’s transcriptions are missing most Greek words (instead, the string ‘~~~~’ appears).

And unlike Sefaria’s edition of Talmud (Hebrew and English), the transcription isn't available on Sefaria to download, nor is it available via the API, or on the Sefaria data dump, at their Github.

I also tried scraping directly from their website with Selenium, but couldn't due to difficulty with dealing with javascript scrolling.

Only after trying all those did I discover that the full text of the dictionary is available at Sefaria’s Github: 1

dictionaries/Jastrow/data/01-Merged XML/Jastrow-full.xml

The text is in XML format.

I wrote a code in Google Colab (Jupyter notebook) that parsed the XML, and split it into ‘headwords’ and entries. (“Headword” is the word being defined. Wiktionary defines “headword”: “A word used as the title of a section, particularly in a dictionary, encyclopedia, or thesaurus.”)

The final code is at the end of this piece.

Further cleaning and processing the data, in Google Sheets

Basic mapping table for Greek to Latin transliteration: with a google sheets function, to automatically transliterate. the function can be used for any automated tranlisteration, just need to provide a mapping table!

The Google Sheets formula:

=JOIN("", ARRAYFORMULA(XLOOKUP(SPLIT(REGEXREPLACE(A1, "(.)", "$1,"), ","), C:C, D:D, "", 1)))

This is by far the simplest way that I found to automatically transliterate. It uses the powerful regex capabilities in Google Sheets (which Excel lacks). And it uses the relatively new functions ‘XLOOKUP’ and ‘SPLIT’. Getting ChatGPT4 to build this formula took many steps, as it’s mostly trained on older data, which didn’t have these functions.

See the full info in the appendices to this piece:

Appendix #2 - Greek-to-Latin characters mapping table
Appendix #3 - Transliteration mapping formula, for Google Sheets

Appendix #1 - final complete code for extracting entries with Greek, from Sefaria’s Jastrow’s Dictionary

Summary

This code downloads an XML file containing dictionary entries, parses the XML, and then extracts entries that contain Greek letters. It then displays these entries in a Markdown table format. Here’s the breakdown:

Imports Necessary Modules:
1. xml.etree.ElementTree: To parse and navigate XML data.
2. pandas: Used for data analysis and manipulation.
3. re: Used for working with regular expressions.
Data Download:
1. It creates a directory named jastrow_data.
2. It then downloads an XML file (named Jastrow-full.xml) from a specified URL and saves it in the jastrow_data directory. (Note: The actual download commands !mkdir and !wget would only work in environments like Jupyter or Colab.)
Function Definitions:
1. xml_content_extractor(element): Extracts text content from an XML element and its child elements. Special handling is included for <sup> tags by adding a '^' before its content.
2. limit_words(text, max_words=30): Truncates the text to the specified number of words (default is 30).
3. contains_greek_letters(text): Checks if the text contains Greek letters using a regular expression.
Parse the XML Content:
1. The XML file is read into a string named xml_content.
2. This string is then parsed to create an XML tree, with root being its root element.
Data Extraction:
1. The code navigates through each <entry> tag in the XML content.
2. For each entry:
  1. It extracts the head word (found in the head-word tag).
  2. It extracts the definition text and notes from their respective tags.
  3. The definition and notes texts are combined and newlines are replaced with spaces.
  4. The combined text is truncated to the first 30 words.
  5. If the truncated text contains Greek letters, the head word and the truncated text are added as a pair to a list named data.
Display Data:
1. It prints the extracted data in a Markdown table format.

Code

import xml.etree.ElementTree as ET
import pandas as pd
import re

!mkdir jastrow_data
!wget https://raw.githubusercontent.com/Sefaria/Sefaria-Data/947c1b91684df9f8b92f14cf0d281b5d4f29bfc7/dictionaries/Jastrow/data/01-Merged%20XML/Jastrow-full.xml -P jastrow_data/

def xml_content_extractor(element):
    """Extract text content from an element and its children, while minimizing whitespace."""
    if element is None:
        return ""


    texts = [element.text or ""]


    for child in element:
        child_text = xml_content_extractor(child)


        # Special handling for <sup> tags
        if child.tag == "sup":
            child_text = f"^{child_text}"


        texts.append(child_text)


        if child.tail:
            texts.append(child.tail.strip())


    return "".join(texts).strip()


def limit_words(text, max_words=30):
    """Limit the text to the first max_words words."""
    words = text.split()
    return ' '.join(words[:max_words])


def contains_greek_letters(text):
    """Check if the text contains Greek letters."""
    return bool(re.search("[\u0370-\u03FF]+", text))


# Load the XML content from the file
with open('/content/jastrow_data/Jastrow-full.xml', 'r', encoding='utf-8') as file:
    xml_content = file.read()


# Parse the XML content
root = ET.fromstring(xml_content)


# Extract the content for each <entry> in the XML content
data = []
for entry in root.findall(".//entry"):
    head_word = entry.find("head-word").text
    definition_text = xml_content_extractor(entry.find(".//definition"))
    notes_text = xml_content_extractor(entry.find(".//notes"))


    combined_text = f"{definition_text} {notes_text}".replace("\n", " ").strip()
    limited_text = limit_words(combined_text)


    if contains_greek_letters(limited_text):
        data.append([head_word, limited_text])


# Display the extracted data in Markdown format
print("|Head Word|Definition|")
print("|---------|----------|")
for item in data[:2000]:
    print(f"|{item[0]}|{item[1]}|")

Appendix #2 - Greek-to-Latin characters mapping table

Screenshot of table, showing lowercase (=miniscule) letters, the full table also includes uppercase letters.

Appendix #3 - Transliteration mapping formula, for Google Sheets

The Google Sheets formula:

=JOIN("", ARRAYFORMULA(XLOOKUP(SPLIT(REGEXREPLACE(A1, "(.)", "$1,"), ","), C:C, D:D, "", 1)))

This function takes the content of cell A1, splits it into individual characters, then looks up each character in column C and replaces it with the corresponding value from column D. Finally, it concatenates the results into a single string. It's essentially performing a character-by-character translation based on the mapping in columns C and D.

Step-by-step breakdown of the formula:

REGEXREPLACE(A1, "(.)", "$1,"):
1. This function uses a regular expression to manipulate the content of cell A1.
2. The pattern (.) matches any single character.
3. "$1," replaces each matched character with itself followed by a comma.
4. For instance, if A1 contains the string "ABC", this function will transform it to "A,B,C,".
SPLIT(..., ","):
1. Takes the output from the previous step and splits it at each comma.
2. Continuing with the "ABC" example, the output of this function will be an array: ["A", "B", "C"].
XLOOKUP(..., C:C, D:D, "", 1):
1. This function performs a lookup for each value in the array generated in the previous step.
2. It searches for the value in column C and returns the corresponding value from column D.
3. If a match isn't found, it returns an empty string ("").
4. The 1 at the end indicates an exact match lookup.
ARRAYFORMULA(...):
1. This function ensures that the operations inside it (like XLOOKUP) are applied to each element of the array individually.
2. So, if the array from the SPLIT function was ["A", "B", "C"], and column C had "A", "B", "C" with corresponding values "X", "Y", "Z" in column D, then the output of this function will be an array: ["X", "Y", "Z"].
JOIN("", ...):
1. This function joins the elements of the array generated in the previous step into a single string without any delimiter.
2. Continuing with the example, the final output would be "XYZ".

Thanks to CM from the ‘Ask the Beit Midrash’ private Facebook group for pointing out this source, in a comment to my post there asking about this.